We have to import all the python libraries that we will need in our classification.

In [190]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Now we read our excel file to dataframe. I have prepared the file with 60 sentences: 20 in polish, 20 in english and 20 in spanish. The task for the classifier is to learn how to classify the sentence to the correct language.  
You can use your own dataset to classify. Just name the Excel file "data.xlsx" and the columns "name" and "category".

In [194]:
df = pd.read_excel("data.xlsx")
X = df['name']
y = df['category']

We have to divide our dataset into train and test group. 

In [180]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In order to classify text you have to vectorize it (every word is represented by 0 if it does not occur in the sentence and 1 if it occurs).

In [181]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train)
X_train=vectorizer.transform(X_train)

Now we choose the classifier - logistic regression and we train it with our x and y train data.

In [182]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

And we are ready to test our classifier with the x test data.

In [183]:
X_test_v = vectorizer.transform(X_test)
y_predicted = clf.predict(X_test_v)
max_probability = clf.predict_proba(X_test_v).max(axis=1)

We can save all the results in one dataframe to show the sentence, correct language, predicted language and the probability of our prediction

In [184]:
df_results = pd.DataFrame({'sentence' : X_test, 'correct category' : y_test, 'predicted category' : y_predicted, 'probability': max_probability})
df_results

Unnamed: 0,sentence,correct category,predicted category,probability
0,To jest pierwsze zdanie po polsku.,polish,polish,0.40608
5,Trzeba zrobić pranie i ugotować obiad dla dzieci.,polish,polish,0.585305
36,Każdego dnia ćwiczę przez pół godziny rano i w...,polish,polish,0.447449
45,When will a coronavirus vaccine be ready and h...,english,polish,0.37534
13,I was told that the meaning is the same with t...,english,english,0.916531
54,"Barcelona es una ciudad española, capital de l...",spanish,spanish,0.856561
33,Codziennie po lekcjach gram w tenisa ze swoim ...,polish,polish,0.54233
48,Get a one month free trial.,english,polish,0.418091
12,I like my dog and I hate cats.,english,polish,0.404264
57,¡Bienvenido a la única web especializada en ac...,spanish,spanish,0.890319


We can also see the classification report (with precision, recall, F1-score and accuracy)

In [185]:
print(classification_report(y_test, y_predicted))

precision    recall  f1-score   support

     english       1.00      0.50      0.67         6
      polish       0.73      1.00      0.84         8
     spanish       1.00      1.00      1.00         4

    accuracy                           0.83        18
   macro avg       0.91      0.83      0.84        18
weighted avg       0.88      0.83      0.82        18



And now it is the time to enter our own sentence to classify.

In [186]:
sentence_to_classify = input('enter the sentence to classify (in polish, english or spanish): ')

We have to change it in the vectorized list.

In [187]:
list_to_predict = list()
list_to_predict.append(sentence_to_classify)
list_to_predict = vectorizer.transform(list_to_predict)

And now we are ready to predict using our logistic regression classifier.

In [188]:
predicted = clf.predict(list_to_predict)[0]
predicted_probability = clf.predict_proba(list_to_predict)
best_predicted_probability = round(predicted_probability.max(axis=1)[0],4)

In [189]:
print("You have written the sentence: {}".format(sentence_to_classify))
print("We predict that this is sentence in {} with the probability {}.".format(predicted, best_predicted_probability))

You have written the sentence: nic mi się nie chce dzisiaj robić
We predict that this is sentence in polish with the probability 0.6481.
