# **Language Identification Hackaton**

South Africa is a multicultural society that is characterised by its rich linguistic diversity with 11 official langauges. We are building a machine learning algorithm that could determine the natural language that a piece of text is written in (using texts in South Africa languages for the model building).

# 1. Importing of Packages

In [10]:
# Packages for data analysis
import pandas as pd
import numpy as np
import time

# Packages for visualizations
import seaborn as sns
import matplotlib.style as style

# Packages for preprocessing
import nltk
import string
import re
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer

# Packages for training models
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
import xgboost as xgb

# Model Evaluation Packages
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from sklearn.metrics import make_scorer

import matplotlib.pyplot as plt
%matplotlib inline

# Style
sns.set(font_scale=1.5)
style.use('seaborn-pastel')
style.use('seaborn-poster')


In [11]:
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

# 2. Loading of Dataset

In [12]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/south-african-language-identification-hack-2022/sample_submission.csv
/kaggle/input/south-african-language-identification-hack-2022/test_set.csv
/kaggle/input/south-african-language-identification-hack-2022/train_set.csv


In [13]:
# importing the dataset
train = pd.read_csv('../input/language-identification/train_set.csv')
test = pd.read_csv('../input/language-identification/test_set.csv')
sample_submission = pd.read_csv('../input/language-identification/sample_submission.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../input/language-identification/train_set.csv'

In [None]:
print(train['text'].head(7))

In [None]:
test.head(7)

In [None]:
sample_submission.head()

## 2.1 General Overview of Dataset

In [None]:
train.lang_id.value_counts()

In [None]:
# Taking general overview at both datasets
print('TRAINING DATA')
print('============='+('\n'))
print('Shape of the dataset: {}\n'.format(train.shape))
print('Total Number of unique tweets: {}\n'.format(len(set(train['text']))))
print('Total Number of missing values:\n{}\n\n'.format(train.isnull().sum()))
print('TEST DATA')
print('========='+('\n'))
print('Shape of the dataset: {}\n'.format(test.shape))
print('Total Number of unique tweets: {}\n'.format(len(set(test['text']))))
print('Total Number of missing values:\n{}\n' .format(test.isnull().sum()))


# 3. Data Preprocessing

In [None]:
def clean_text(text):
    """
    This function uses regular expressions to remove html characters,
    punctuation, numbers and any extra white space from each text
    and then converts them to lowercase.

    Input:
    text: original text
          datatype: string

    Output:
    texts: modified text
           datatype: string
    """
    # replace the html characters with " "
    text=re.sub('<.*?>', ' ', text)
#     Removal of numbers
#    text = re.sub(r'\d+', ' ', text)
    # will replace newline with space
    text = re.sub("\n"," ",text)
    # will convert to lower case
    text = text.lower()
    # will split and join the words
    text=' '.join(text.split())
    return text


In [None]:
# Application of the function to clean the tweets
train['text'] = train['text'].apply(clean_text)
test['text'] = test['text'].apply(clean_text)


In [None]:
# Replace '.txt' with 'text file'
train["text"] = train["text"].str.replace(".txt", " text file")
test["text"] = test["text"].str.replace(".txt", " text file")


# 4. Feature Engineering

## 4.1 Splitting out X (indepedent) and Y (target/dependent) variables


In [None]:
X = train['text']
y = train['lang_id']


## 4.2 Splitting of Training and Validation Sets


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10)


# 5. Model Building

## 5.1 Setting up Classifiers for Model Training

In [None]:
"""
Note: Some classifiers were commented out because
they run for a very long time, 
"""
classifiers = [LinearSVC(random_state=42),
               # SVC(),
               # tree.DecisionTreeClassifier(),
               # RandomForestClassifier(n_estimators=100, max_depth=2,
               #                      random_state=0, class_weight="balanced"),
               # MLPClassifier(alpha=1e-5,
               #              hidden_layer_sizes=(5, 2),
               #              random_state=42),
               LogisticRegression(random_state=42,
                                  multi_class='ovr',
                                  n_jobs=1,
                                  C=1e5,
                                  max_iter=4000),
               KNeighborsClassifier(n_neighbors=5),
               MultinomialNB(),
               ComplementNB(),
               SGDClassifier(loss='hinge',
                             penalty='l2',
                             alpha=1e-3,
                             random_state=42,
                             max_iter=5,
                             tol=None)
               # GradientBoostingClassifier()
               # xgb.XGBClassifier(learning_rate=0.1,
               #                  n_estimators=1000,
               #                  max_depth=5,
               #                  min_child_weight=1,
               #                  gamma=0,
               #                  subsample=0.8,
               #                  colsample_bytree=0.8,
               #                  nthread=4,
               #                  seed=27)
               ]


**Creating Function for Model Building**

In [None]:
def models_building(classifiers, X_train, y_train, X_val, y_val):
    """
    This function takes in a list of classifiers
    and both the train and validation sets
    and return a summary of F1-score and
    processing time as a dataframe

    Input:
    classifiers: a list of classifiers to train
                 datatype: list
    X_train: independent variable for training
             datatype: series
    y_train: dependent variable for training
             datatype: series
    X_val: independent variable for validation
           datatype: series
    y_val: dependent variable for validation
           datatype: series

    Output:
    model_summary: F1 Score for all the classifiers
                   datatype: dataframe
    """

    models_summary = {}

    # Pipeline to balance the classses and then to build the model
    for clf in classifiers:
        clf_text = Pipeline([('tfidf', TfidfVectorizer(min_df=1,
                                                       max_df=0.9,
                                                       ngram_range=(1, 2))),
                             ('clf', clf)])

        # Logging the Execution Time for each model
        start_time = time.time()
        clf_text.fit(X_train, y_train)
        predictions = clf_text.predict(X_val)
        run_time = time.time()-start_time

        # Output for each model
        models_summary[clf.__class__.__name__] = {
            'F1-Macro': metrics.f1_score(y_val,
                                         predictions,
                                         average='macro'),
            'F1-Accuracy': metrics.f1_score(y_val, predictions,
                                            average='micro'),
            'F1-Weighted': metrics.f1_score(y_val,
                                            predictions,
                                            average='weighted'),
            'Execution Time': run_time}

    return pd.DataFrame.from_dict(models_summary, orient='index')


### 5.1.1 Execution of the Classifiers

In [None]:
classifiers_df = models_building(classifiers, X_train, y_train, X_val, y_val)
ordered_df = classifiers_df.sort_values('F1-Macro', ascending=False)
ordered_df


### 5.1.2 Comparing Classification Methods

The most performing is the Multinomial Naive Bayes with F1-Macro of 99.9% and accuracy of 99.9% while closely followed by Complement Naive Bayes, Logistic Regression, Linear Support Vector Classifier, Support Vector Machine etc.

We will proceed with the first two algorithms (to see which will come out better) by applying hyperparameter tunining, as they are the most performing models and considering their execution time.

## 5.2 Hyperparameter Tuning on Most Performing Models

In [None]:
# Refining the train-test split for validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.01)


### 5.2.1 Multinomial Naive Bayes

In [None]:
# Creating a pipeline for the gridsearch
param_grid = {'alpha': [0.1, 1, 5, 10]}  # setting parameter grid

tuned_mnb = Pipeline([('tfidf', TfidfVectorizer(min_df=2,
                                                max_df=0.9,
                                                ngram_range=(1, 2))),
                      ('mnb', GridSearchCV(MultinomialNB(),
                                           param_grid=param_grid,
                                           cv=5,
                                           n_jobs=-1,
                                           scoring='f1_weighted'))
                      ])

tuned_mnb.fit(X_train, y_train)  # Fitting the model

y_pred_mnb = tuned_mnb.predict(X_val)  # predicting the fit on validation set

print(classification_report(y_val, y_pred_mnb))


## 5.3 Creating File for Submission

In [None]:
submission_df = pd.DataFrame(test['index'])
submission_df['lang_id'] = tuned_mnb.predict(test['text'])
submission_df.to_csv('submission_tuned_multinomial_NB.csv', index=False)


# 6. Conclusion

Several algorithms were tried and Multinomial Naive Bayes classifier was the most performing. It performed very well on the training and validation datasets with an accuracy score of over 99% and F1 Macro score of over 99%. After testing the fitted model on the held-out/unseen dataset, it was able to predict the classes of languages with an F1 Score of about 97%.

# 7. References

1. Hyperparameters and Model Validation (An overview of classification model hyperparameters, hyperparameter tuning, and model validation) - Explore Data Science Academy
2. https://scikit-learn.org/stable/modules/grid_search.html
3. 