Skip to content

Preforming language detection on several texts using machine learning algorithms

Notifications You must be signed in to change notification settings

laila-sul/Language-detection-project

Repository files navigation

Language-detection-project

Preforming language detection on several texts using machine learning algorithms.

methodology followed in the project is represented in this map:

image

1-Data unserstanding we started by data collection, using the selenium python library we scraped data from twitter we basically foccused on scrapping data in 3 different languages: Darija, French and English. Then we explored our dataset to understand it's specifities and caracteristics.

2-Data preprocessing This is one of the most important steps in any modelisation probleme, data preprocessing plays a crucial role since the modelisation technics are not equipped to process non-structed data especially in our case, where we're dealing with textual data. well see more details further in the notebook.

3-Modeling After getting our data ready, and compatible with machine learning algorithms inputs, we're ready tobuild our model, the challenge here is that we have several types of algorithms and we will have to chose which one preforms the best in our case.

4-Evaluation After building our models we move to evaluationg them using different technics.

After succesfully cleaning our dataset, we move to building the matrix how is it done?

Vectorization

To move on to the creation of machine learning models, we must first transform the text into a data matrix that corresponds to the processing by ML algorithms, while trying to minimizing the loss of information as much as possible. each line in our dataset will represent the lines of our matrix hence we speak of a vector presentation, but in order to determine the features or the indexes we will use the countvectorizer. The countvectorizer has many parameters to do indexation we have chosen to use the N-gram of letters. Here's a schema of what we're going to do using n-gram

all

all

1-Unigrams

vector presentation of languages using uni-gram

tale

Let's take an exemple to understand what's going on:

Here's how does the countvectorizer work:

tp1

Ps: in this exemple we're refering to the uni-gram parameter, it's pretty clear since the size of the constracted vector is equal to 32 which is the number of unique features in uni-gram!

2-Bigrams

top bigrams (>1%) for each language

tale

Let's take an exemple to understand what's going on:

Here's how does the countvectorizer work:

tp1

tp1

Ps: in this exemple we're refering to the bi-gram parameter, the size of the constracted vector is equal to 941 which is the number of unique features in bi-grams!

3-Top 1% Mixture Uni-grams and Bi-grams

top Mixture (>1%) for each language

tale

tale

tale

tale

Let's take an exemple to understand what's going on:

Here's how does the countvectorizer work:

tp1

tp1

Ps: in this exemple we're refering to the mixture parameter, the size of the constracted vector is equal to 471 .

4-Top 60 Mixture Uni-grams and Bi-grams

top 60 Mixture for each language

tale

tale

Let's take an exemple to understand what's going on:

Here's how does the countvectorizer work:

tp1

Models

For this problem, we used 3 classification models:

Naive Bayes Multinomial

K Nearest Neighbor

Logistic Regression

Result: After applying k-fold cross validation, we found that Logistic regression using Top 1% Mixture is the best model, because it was able to distinguish more or less between the languages.

About

Preforming language detection on several texts using machine learning algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published