# South-African-language-identification-hack-2022

© Explore Data Science Academy

---
### Honour Code

I {**Olukayode Oloyede**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [157]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np                     
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
# Customise our plotting settings

sns.set_style('whitegrid')

#Libraries for data cleaning and preprocessing

from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils import resample
import string
import re
import pickle
import nltk

#Libraries for data preparation and model building

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.neighbors import KNeighborsClassifier
# Classification report

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
#Loading the df_train and df_test data
df_train= pd.read_csv("train_set.csv")
df_test= pd.read_csv("test_set.csv")

In [131]:
#df_test11= pd.read_csv("test_set.csv")

In [3]:
df_train.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [4]:
df_test.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini...
4,5,Winste op buitelandse valuta.


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [5]:
#checking the first 5 rows of the data
df_train.head()




Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...


After taking a look at the frist five rows of the dataframe we can see that we have Three (2) columns in the dataFrame.

We have one feature and one label

features inludes:

- text

label:

- lang_id

And the test dataFrame contains only the features

In [13]:
df_test = df_test.set_index('index')

In [16]:
#checking the first 5 rows of the data
df_test= df_test.drop('level_0', axis =1)

In [17]:
df_test.head()

Unnamed: 0_level_0,text
index,Unnamed: 1_level_1
1,"Mmasepala, fa maemo a a kgethegileng a letlele..."
2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
3,Tshivhumbeo tshi fana na ngano dza vhathu.
4,Kube inja nelikati betingevakala kutsi titsini...
5,Winste op buitelandse valuta.


We will take a look at the shape of the dataframe to see the amount of data we are working with, the rows and the columns

In [18]:
#checking the total number of rows and columns
df_train.shape


(33000, 2)

The training dataset has 33000 rows and 2 columns

Let's take a look at the data types in the dataframe using pd.info() to get more information about the dataframe

In [19]:
#checking to see if null value exist in each column and the data type of each observation
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   lang_id  33000 non-null  object
 1   text     33000 non-null  object
dtypes: object(2)
memory usage: 515.8+ KB


The above output shows that we have two int64 and one object

In [20]:
#checking for missing/null values
df_train.isnull().sum()

lang_id    0
text       0
dtype: int64

In [21]:
#checking for unique values 
df_train['lang_id'].value_counts()

xho    3000
eng    3000
nso    3000
ven    3000
tsn    3000
nbl    3000
zul    3000
ssw    3000
tso    3000
sot    3000
afr    3000
Name: lang_id, dtype: int64

The above output shows that the train dataset has 11 unique values in the label.

let's count and plot the distribution of each unique value

In [22]:
#ploting the distribution of unique label values
#f, ax = plt.subplots(figsize=(8, 4))
#ax = sns.countplot(x="sentiment", data=df_train)
#plt.show()

let's explore our features to gain more insight

In [24]:
#checking the tweetid to see if there are any duplicate id's
#df_train['tweetid'].nunique()

In [25]:
#df_train['lang_id'].duplicated().sum()

In [26]:
#taking a colser look on the message column
df_train['text']

0        umgaqo-siseko wenza amalungiselelo kumaziko ax...
1        i-dha iya kuba nobulumko bokubeka umsebenzi na...
2        the province of kwazulu-natal department of tr...
3        o netefatša gore o ba file dilo ka moka tše le...
4        khomishini ya ndinganyiso ya mbeu yo ewa maana...
                               ...                        
32995    popo ya dipolateforomo tse ke go tlisa boetele...
32996    modise mosadi na o ntse o sa utlwe hore thaban...
32997    closing date for the submission of completed t...
32998    nawuphina umntu ofunyenwe enetyala phantsi kwa...
32999    mafapha a mang le ona a lokela ho etsa ditlale...
Name: text, Length: 33000, dtype: object

### 3. Data cleaning


Pre-processing text data

we will be cleaning our data with following steps below.

* Remove punctuations
* Tokenization - Converting a sentence into list of words
* Remove stopwords - a list of frequently appearing english words in sentences
* Lammetization/stemming - Tranforming any form of a word to its root word

In [30]:
#removes all websites and replaces them with the text 'web-url'
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
df_train['text_no_url'] = df_train['text'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [31]:
df_train.head()

Unnamed: 0,lang_id,text,text_no_url
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqo-siseko wenza amalungiselelo kumaziko ax...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,i-dha iya kuba nobulumko bokubeka umsebenzi na...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulu-natal department of tr...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [27]:
# Remove Punctuation
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [28]:
#Data preprocessing
#function that handles the removal punctuations from the tweets
def remove_punct(text):
    """
    the function remove_punction, it takes in a text as input and loops through
    the text, if a character is not in string.punctuation then it adds the character
    as a string to the text variable
    
    """
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

In [33]:
#apply the remve_puct func to the tweets column
df_train['text_no_url'] = df_train['text_no_url'].apply(lambda x: remove_punct(x))
df_train.head()

Unnamed: 0,lang_id,text,text_no_url
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulunatal department of tra...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...


In [34]:
# Data cleaning 
def clean_data(texts):
    
    """
    clean_data(text), the function further cleans the data using (re)
    by removing extract white spaces and non text characters
    
    """
    words = list()
    for text in texts.split():
        # remove non text character from start and end of string
        text = re.sub(r'(^\W+|\W+$)','',text)
#       #remove multiple white spaces
        text = re.sub(r'\s+','',text)
#       #remove non text characters and emojis between texts
        text = re.sub(r'\W+',r'',text)
#       #remove white space at the end of strings
        text = re.sub(r'\s+$',r'',text)
#       #Remove unwanted symbols
        text = re.sub(r'[#,@,$_,?*//""]',r'',text)
        words.append(text.lower())
            
        text = [i for i in words if len(i) >= 2]

    return " ".join(text)

In [35]:
#applying the clean_data function
df_train['text_clean'] = df_train['text_no_url'].apply(clean_data)

In [36]:
df_train.head()

Unnamed: 0,lang_id,text,text_no_url,text_clean
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko ax...,umgaqosiseko wenza amalungiselelo kumaziko axh...,umgaqosiseko wenza amalungiselelo kumaziko axh...
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi na...,idha iya kuba nobulumko bokubeka umsebenzi nap...,idha iya kuba nobulumko bokubeka umsebenzi nap...
2,eng,the province of kwazulu-natal department of tr...,the province of kwazulunatal department of tra...,the province of kwazulunatal department of tra...
3,nso,o netefatša gore o ba file dilo ka moka tše le...,o netefatša gore o ba file dilo ka moka tše le...,netefatša gore ba file dilo ka moka tše le dum...
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...,khomishini ya ndinganyiso ya mbeu yo ewa maana...


### 4. Feature Engineering


First we will start with tokenization Converting a sentence into list of words

In [37]:
#applying tokenization to the data set
tokeniser = TreebankWordTokenizer()
df_train['tokens'] = df_train['text_no_url'].apply(tokeniser.tokenize)

In [38]:
#applying Lammetization
lemmatizer = WordNetLemmatizer()

In [39]:
#function that handles the process of lemmatization
def extract_lemma(words, lemmatizer):
    return ' '.join([lemmatizer.lemmatize(word) for word in words])

In [40]:
#calling extract_lemma function on the tokens column
df_train['lemma'] = df_train['tokens'].apply(extract_lemma, args=(lemmatizer, ))

In [41]:
#using countVectorizer
vectorizer = CountVectorizer(lowercase=True, stop_words='english', analyzer='word', ngram_range=(1, 1))

In [113]:
X_count = vectorizer.fit_transform(df_train['lemma'])

In [108]:
X_count

<33000x142647 sparse matrix of type '<class 'numpy.int64'>'
	with 873956 stored elements in Compressed Sparse Row format>

In [109]:
# Extraxt features to help predict the label 
X = X_count

In [110]:
X_count.shape

(33000, 142647)

In [48]:
X

<33000x142647 sparse matrix of type '<class 'numpy.int64'>'
	with 873956 stored elements in Compressed Sparse Row format>

In [35]:
#pickling the vectorizer of deployment
#pickle.dump(vectorizer, open("vector.pkl", "wb"))

In [114]:
y_count = vectorizer.transform(df_train['lang_id'])

In [115]:
y_count

<33000x142647 sparse matrix of type '<class 'numpy.int64'>'
	with 6000 stored elements in Compressed Sparse Row format>

In [57]:
# Determine our Label
#y = (df_train['lang_id'])

In [116]:
# Split Data (into Training & Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40 , random_state=0)

#### TEST DATA PREPROCESSING

All the preprocessing steps which were applied on the training data will also be applied on the test data.

In [84]:
df_test.head(10)

Unnamed: 0_level_0,text,test_no_url,text_clean,tokens,lemma
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Mmasepala, fa maemo a a kgethegileng a letlele...",Mmasepala fa maemo a a kgethegileng a letlelel...,mmasepala fa maemo kgethegileng letlelela kgat...,"[mmasepala, fa, maemo, kgethegileng, letlelela...",mmasepala fa maemo kgethegileng letlelela kgat...
2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,uzakwaziswa ngokufaneleko nakungafuneka eminye...,"[uzakwaziswa, ngokufaneleko, nakungafuneka, em...",uzakwaziswa ngokufaneleko nakungafuneka eminye...
3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu,tshivhumbeo tshi fana na ngano dza vhathu,"[tshivhumbeo, tshi, fana, na, ngano, dza, vhathu]",tshivhumbeo tshi fana na ngano dza vhathu
4,Kube inja nelikati betingevakala kutsi titsini...,Kube inja nelikati betingevakala kutsi titsini...,kube inja nelikati betingevakala kutsi titsini...,"[kube, inja, nelikati, betingevakala, kutsi, t...",kube inja nelikati betingevakala kutsi titsini...
5,Winste op buitelandse valuta.,Winste op buitelandse valuta,winste op buitelandse valuta,"[winste, op, buitelandse, valuta]",winste op buitelandse valuta
6,"Ke feela dilense tše hlakilego, tša pono e tee...",Ke feela dilense tše hlakilego tša pono e tee ...,ke feela dilense tše hlakilego tša pono tee go...,"[ke, feela, dilense, tše, hlakilego, tša, pono...",ke feela dilense tše hlakilego tša pono tee go...
7,<fn>(762010101403 AM) 1495 Final Gems Birthing...,fn AM Final Gems Birthing OptionsZULUtxtfn,fn am final gems birthing optionszulutxtfn,"[fn, am, final, gems, birthing, optionszulutxtfn]",fn am final gem birthing optionszulutxtfn
8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bon...,Ntjhafatso ya konteraka ya mosebetsi Etsa bonn...,ntjhafatso ya konteraka ya mosebetsi etsa bonn...,"[ntjhafatso, ya, konteraka, ya, mosebetsi, ets...",ntjhafatso ya konteraka ya mosebetsi etsa bonn...
9,u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezi...,uGEMS uhlinzeka ngezinzuzo zemithi yezifo ezin...,ugems uhlinzeka ngezinzuzo zemithi yezifo ezin...,"[ugems, uhlinzeka, ngezinzuzo, zemithi, yezifo...",ugems uhlinzeka ngezinzuzo zemithi yezifo ezin...
10,"So, on occasion, are statistics misused.",So on occasion are statistics misused,so on occasion are statistics misused,"[so, on, occasion, are, statistics, misused]",so on occasion are statistic misused


In [83]:
df_test.shape

(5682, 5)

In [65]:
#creating a test dataframe
df_test = pd.DataFrame(df_test[['text']])

In [67]:
#removing http and replacing it with url pattern
df_test['test_no_url'] = df_test['text'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [69]:
#applying the remove punctuation funtion to the test data
df_test['test_no_url'] = df_test['test_no_url'].apply(lambda x: remove_punct(x))
df_test.head()

Unnamed: 0_level_0,text,test_no_url
index,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Mmasepala, fa maemo a a kgethegileng a letlele...",Mmasepala fa maemo a a kgethegileng a letlelel...
2,Uzakwaziswa ngokufaneleko nakungafuneka eminye...,Uzakwaziswa ngokufaneleko nakungafuneka eminye...
3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu
4,Kube inja nelikati betingevakala kutsi titsini...,Kube inja nelikati betingevakala kutsi titsini...
5,Winste op buitelandse valuta.,Winste op buitelandse valuta


In [70]:
#apply the clean_data function
df_test['text_clean'] = df_test['test_no_url'].apply(clean_data)

In [71]:
#applying tokenizer
df_test['tokens'] = df_test['text_clean'].apply(tokeniser.tokenize)

In [72]:
#applying the extract_lemma function
df_test['lemma'] = df_test['tokens'].apply(extract_lemma, args=(lemmatizer, ))

In [117]:
#transforming the data using vectorizer
test_count = vectorizer.transform(df_test['lemma'].values.astype(str))

In [118]:
#selecting the feature
x_test = test_count.toarray()

In [119]:
#the shape of the feature
x_test.shape

(5682, 142647)

### Model Development

#### Logistic Regression Model

In [53]:
lg_clf = LogisticRegression(random_state = 0)

In [120]:
lg_clf.fit(X_train, y_train)

In [121]:
y_pred= lg_clf.predict(X_test)

In [122]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00      1228
         eng       0.99      1.00      1.00      1159
         nbl       0.99      0.98      0.98      1258
         nso       1.00      1.00      1.00      1160
         sot       1.00      1.00      1.00      1188
         ssw       1.00      1.00      1.00      1219
         tsn       1.00      1.00      1.00      1171
         tso       1.00      1.00      1.00      1197
         ven       1.00      1.00      1.00      1200
         xho       0.99      0.99      0.99      1216
         zul       0.98      0.98      0.98      1204

    accuracy                           0.99     13200
   macro avg       0.99      0.99      0.99     13200
weighted avg       0.99      0.99      0.99     13200



In [76]:
#y_pred = lg_clf.predict(X_test)
f1_logistic = f1_score(y_test, y_pred, average = 'weighted')

In [123]:
#making prediction
prediction = lg_clf.predict(x_test)

In [130]:
df_test2 = df_test

In [133]:
prediction

array(['eng', 'nbl', 'xho', ..., 'sot', 'sot', 'xho'], dtype=object)

In [137]:
df_test11['index']

0          1
1          2
2          3
3          4
4          5
        ... 
5677    5678
5678    5679
5679    5680
5680    5681
5681    5682
Name: index, Length: 5682, dtype: int64

In [143]:
#creating a dataframe for the submission
submission = pd.DataFrame(list(zip(df_test11['index'],prediction)), columns = ['index','lang_id'])
submission.head()

Unnamed: 0,index,lang_id
0,1,eng
1,2,nbl
2,3,xho
3,4,ssw
4,5,eng


In [144]:
#submission=submission.set_index('index')
submission.head()

Unnamed: 0,index,lang_id
0,1,eng
1,2,nbl
2,3,xho
3,4,ssw
4,5,eng


In [146]:
submission.shape

(5682, 2)

In [145]:
#saving the file as csv
submission.to_csv('submission_gideon.csv', index_label = False, index = False)

#### Random Forest

In [147]:
f_clf = RandomForestClassifier(random_state=0)
f_clf.fit(X_train, y_train)

In [153]:
# Evaluate trained model using the test set
# Generate predictions
y_forest = f_clf.predict(x_test)

In [154]:
#creating a dataframe for the submission
submission2 = pd.DataFrame(list(zip(df_test11['index'],y_forest)), columns = ['index','lang_id'])
submission2.head()

#saving the file as csv
submission2.to_csv('submission_RF.csv', index_label = False, index = False)

In [155]:
submission2.shape

(5682, 2)

#### KNN Model

In [158]:
n_neighbors = 3 # <--- change this number to play around with how many nearest neighbours to look for.

knn = KNeighborsClassifier(n_neighbors)
# Fit the model 
knn.fit(X_train, y_train)

In [160]:
# Evaluate trained model using the test set
# Generate predictions
#y_hat = knn.predict_proba(x_test)

In [None]:
#creating a dataframe for the submission
submission3 = pd.DataFrame(list(zip(df_test11['index'],y_hat)), columns = ['index','lang_id'])
submission3.head()

#saving the file as csv
submission2.to_csv('submission_KNN.csv', index_label = False, index = False)