<a href="https://colab.research.google.com/github/snekz/lev_lda/blob/master/dialect_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dialect Identification  
This is a pretrained model for Arabic dialect classification. It classifies Arabic texts into Levantine and non-Levantine Arabic. Levantine Arabic includes Arabic common in the Middle East Levant area. This model was trained on 160k mixed-genre sentences (of which around 14.6% were Levantine Arabic).

### Source

This pre-trained model was trained using [deep models originally developed for AOC dialect identification task](https://github.com/UBC-NLP/aoc_id) with modifications. 

### Details
For more details, see section **4.1.2** "Filtering out Non-Target Varieties of Arabic" [in the paper](http://uu.diva-portal.org/smash/record.jsf?pid=diva2%3A1439483&dswid=-6519). 



In [1]:
import json
from keras.models import model_from_json # to load model
from keras.preprocessing.text import Tokenizer # tokenization
from keras.layers import Input # input layer
import numpy as np
import keras.backend as K # to calculated f1_score

Using TensorFlow backend.


In [None]:
# required files

bigru_model = 'bigru_binary_10_epochs.json'
model_weights = 'bigru_binary_10_epochs.h'
text_data = 'tweets_mixed.txt'

In [4]:
# manual upload 

from google.colab import files

load_model = files.upload()
model_weights = files.upload()
text_data = files.upload()

Saving bigru_binary_10_epochs.json to bigru_binary_10_epochs (2).json


Saving bigru_binary_10_epochs.h to bigru_binary_10_epochs (2).h


Saving tweets_mixed.txt to tweets_mixed.txt


In [5]:
def get_f1(y_true, y_pred): #taken from old keras source code
  true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
  possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
  predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
  precision = true_positives / (predicted_positives + K.epsilon())
  recall = true_positives / (possible_positives + K.epsilon())
  f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
  return f1_val

In [6]:
# alternative to data_helpers.LoadPRED without normalization

def loadPretrained(data):
	df = open(data, 'r')
	sentences = []
	for line in df:
		sentences.append(line)
	return sentences


In [9]:
# load model

with open('bigru_binary_10_epochs.json', 'r') as json_file:
    architecture = json.load(json_file)
    model = model_from_json(json.dumps(architecture))

In [11]:
# load weights and compiling model

model.load_weights('bigru_binary_10_epochs.h')
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=[get_f1])

In [15]:
# import data to filter

sentences = loadPretrained('tweets_mixed.txt')

In [16]:
# prepare data for input

def tokenizeData(data): 
    #init tokenizer
    tokenizer= Tokenizer(filters='\t\n',split=" ",char_level=False)
    #use tokenizer to split vocab and index them
    tokenizer.fit_on_texts(data)
    # txt to seq
    data = tokenizer.texts_to_sequences(data)
    
    return data

tok_sent = tokenizeData(sentences)

arr = np.array(tok_sent)
#np.array(arr[-1]).shape

for s in range(len(tok_sent)):
  for i in range(100-len(tok_sent[s])):
    tok_sent[s].append(0)

In [17]:
# check for problems in input size

for s in tok_sent:
  safe = False
  if len(s) != 100:
    print("Problem detected!")
  else:
    safe = True
if safe:
  print("No issues with input size.")

No issues with input size.


In [None]:
# classify sentences 

# to increase filtering, change x
# x is a value between 0 and 1
# higher x values give stricter classification for levantine texts

x = 0.5

prediction = model.predict(np.reshape(tok_sent[i:i+1], (1,100)))

lev_file = open('Filtered_LEV.txt', 'w')
other_file = open('Filtered_NOT_LEV.txt', 'w')

for i in range(len(tok_sent)):
  prediction = model.predict(np.reshape(tok_sent[i:i+1], (1,100)))
  if float(prediction[0][0]) >= x:
    lev_file.write(sentences[i])
  else:
    other_file.write(sentences[i])
  
print("Done")