# Language Classification Problem

For this problem first lets check data.

In [58]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import csv

In [59]:
english_data = pd.read_csv("english.csv")
turkish_data = pd.read_csv("turkish.csv")

In [60]:
english_data.head()

Unnamed: 0,Word
0,aah
1,aahed
2,aahing
3,aahs
4,aal


In [61]:
turkish_data.head()

Unnamed: 0,Word
0,etmek
1,olmak
2,otu
3,su
4,bilimi


When I looked the head of data I detect some of abbreviations for english dataset.After checking full of data I detect many more abbreviations. In English dataset I saw suffix like 'ly','s',And as for turkish data We can't see verb suffix as "yor" "dı" "tı" but we can see "mak" for verbs. After these observations I think first we do preprocessing for this data like stemming and removing punctions(there is not much of them but they are exist) one of the way to approuch this kind of data is character N-gram. But before we generating model lets split data for testing.

In [62]:
english_data['Language'] = 0
turkish_data['Language'] = 1

In [7]:
turkish_data[1155:1158]

Unnamed: 0,Word,Language
1155,...,1
1156,gömlek,1
1157,sebze,1


In [8]:
import string,nltk
def remove_punct(text):
    text_without_punct = "".join([char for char in text if char not in string.punctuation])
    return text_without_punct
english_data['Word']=english_data['Word'].apply(lambda x: remove_punct(x))
turkish_data['Word']=turkish_data['Word'].apply(lambda x: remove_punct(x))

In [9]:
turkish_data[1155:1158]

Unnamed: 0,Word,Language
1155,,1
1156,gömlek,1
1157,sebze,1


In [10]:
turkish_data.replace("", float("NaN"), inplace=True)
turkish_data.dropna(inplace=True)
turkish_data[1155:1158]

Unnamed: 0,Word,Language
1156,gömlek,1
1157,sebze,1
1158,karga,1


In [11]:
ps=nltk.PorterStemmer()
def stemming(text):
    stemmed_text = [ps.stem(word) for word in text]
    return stemmed_text
english_data['stemmed_words']= english_data['Word'].apply(lambda x: stemming([x]))

In [12]:
english_data[71:75]

Unnamed: 0,Word,Language,stemmed_words
71,abalienate,0,[abalien]
72,abalienated,0,[abalien]
73,abalienating,0,[abalien]
74,abalienation,0,[abalien]


As we can see stemming can effect actual context so I will try both aprouch(with stemming and without stemming).First try without stemming.

In [13]:
all_data = turkish_data.append(english_data, ignore_index=True)
all_data.head()

Unnamed: 0,Word,Language,stemmed_words
0,etmek,1,
1,olmak,1,
2,otu,1,
3,su,1,
4,bilimi,1,


In [14]:
all_data.drop(['stemmed_words'], axis=1)

Unnamed: 0,Word,Language
0,etmek,1
1,olmak,1
2,otu,1
3,su,1
4,bilimi,1
...,...,...
422079,zwinglianism,0
422080,zwinglianist,0
422081,zwitter,0
422082,zwitterion,0


In [15]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(all_data['Word'],all_data['Language'], test_size=0.2, random_state=25)

In [16]:
print(X_train.shape)
print(X_test.shape)

(337667,)
(84417,)


# Lineer Regresion Model Building

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(3, 3),analyzer='char')
vectorizer.fit(X_train)

CountVectorizer(analyzer='char', ngram_range=(3, 3))

In [19]:
print(vectorizer.get_feature_names()[8187])
print(vectorizer.transform(['bağırsağı']))

sağ
  (0, 706)	2
  (0, 760)	1
  (0, 7945)	1
  (0, 8187)	1
  (0, 12026)	1
  (0, 12201)	1


In [21]:
X_train = vectorizer.transform(X_train)
X_test  = vectorizer.transform(X_test)

In [22]:
X_train

<337667x12575 sparse matrix of type '<class 'numpy.int64'>'
	with 2442348 stored elements in Compressed Sparse Row format>

When I looked the head of data I detect some of abbreviations for english dataset.After checking full of data I detect many more abbreviations. In English dataset I saw suffix like 'ly','s',And as for turkish data We can't see verb suffix as "yor" "dı" "tı" but we can see "mak" for verbs. After these observations I think first we do preprocessing for this data like stemming and removing punctions(there is not much of them but they are exist) one of the way to approuch this kind of data is character N-gram. But before we generating model lets split data for testing.

In [25]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [28]:
score = classifier.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.9741876636222562


In [33]:
X_turkish_train,X_turkish_test,y_turkish_train,y_turkish_test = train_test_split(turkish_data['Word'],turkish_data['Language'], test_size=0.2, random_state=26)

In [34]:
X_turkish_test=vectorizer.transform(X_turkish_test)
score = classifier.score(X_turkish_test, y_turkish_test)
print("Accuracy:", score)

Accuracy: 0.8600470588235294


In [57]:
test_words = ['hovarda','haydaaa','hello','harika','helyum','helium']
for word in test_words:
    if(classifier.predict(vectorizer.transform([word]))<1):
        print("English")
    else:
        print("Turkish")

Turkish
Turkish
English
Turkish
Turkish
English


## Comments

As we can see accuracy 0.974 but because of inbalance between datasets( turkish dataset have 53k english dataset have 368k data) for turkish words accuracy is close to 0.86 we can do some processing to increase this accuracy like : increasing weight of turkish words or maybe resample with different ratio or adding some rules. Ofcourse we can create neural network with multiple layer like RNN or Lstm to making it understand weights itself and getting better result with turkish words. I used 3-gram because of RAM limitage on Jupyter Lab(2Gb) but maybe 1-gram 2-gram or combination of 3 of them can give better results we can try it in a better enviroment(More RAM) just chaging by ngram_range.

/////

From here on out I saved my model and vectorizer to load them later and use them in API

In [2]:
import pickle
model_path = "models/model.pickle"
vectorizer_path = "models/vectorizer.pickle"
pickle.dump(classifier, open(model_path, 'wb'))
pickle.dump(vectorizer, open(vectorizer_path, "wb"))

NameError: name 'classifier' is not defined

In [9]:
import pickle
model_path = "models/model.pickle"
vectorizer_path = "models/vectorizer.pickle"

vectorizer = pickle.load(open(vectorizer_path,'rb'))
classifier = pickle.load(open(model_path,'rb'))
test_words = ['caz','blu','hell','karizmatik','ceku','müsamaha']
for word in test_words:
    if(classifier.predict(vectorizer.transform([word]))<1):
        print("English")
    else:
        print("Turkish")

Turkish
English
English
Turkish
English
Turkish


In [12]:
import requests
import json
def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)
response = requests.get("https://api.thedogapi.com/v1/breeds")

In [13]:
import flask
from flask import request, jsonify

app = flask.Flask(__name__)
app.config["DEBUG"] = False

# Create some test data for our catalog in the form of a list of dictionaries.
books = [
    {'id': 0,
     'title': 'A Fire Upon the Deep',
     'author': 'Vernor Vinge',
     'first_sentence': 'The coldsleep itself was dreamless.',
     'year_published': '1992'},
    {'id': 1,
     'title': 'The Ones Who Walk Away From Omelas',
     'author': 'Ursula K. Le Guin',
     'first_sentence': 'With a clamor of bells that set the swallows soaring, the Festival of Summer came to the city Omelas, bright-towered by the sea.',
     'published': '1973'},
    {'id': 2,
     'title': 'Dhalgren',
     'author': 'Samuel R. Delany',
     'first_sentence': 'to wound the autumnal city.',
     'published': '1975'}
]


@app.route('/', methods=['GET'])
def home():
    return '''<h1>dev.to/koraybarkin Flask ile Web API geliştirme</h1><p>Tebrikler ilk Web API'ınızı başarıyla geliştirdiniz!</p>'''


# A route to return all of the available entries in our catalog.
@app.route('/api/v1/resources/books/all', methods=['GET'])
def api_all():
    return jsonify(books)

app.run()

ModuleNotFoundError: No module named 'flask'