<a href="https://colab.research.google.com/github/isegura/TextSimplification/blob/master/ComplexWordIdentification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Complex Word Identification

Our goal is to develop a machine learning based system to determine if a word is complex (difficult) or simple (easy). You can find more information about this task at https://sites.google.com/view/cwisharedtask2018/


We are going to use the dataset provided by the Complex Word Identification (CWI) shared task https://sites.google.com/view/cwisharedtask2018/datasets?authuser=0. This dataset contains a list of words, where each word is classifies as 0 (simple) or as 1 (complex). This tasks is defined for three different languages: English, Spanish and German. 


In this notebook, we will only work with a subset of the English dataset. This dataset consists of two files:
- **Wikipedia_Train1.tsv** containing our training instances.
- **Wikipedia_Dev1.tsv** containing the instances that we will use to evaluate our system.

In this notebook, we will learn the following issues:
- read the dataset
- represent each word using a set of features useful for the task. 
- train a SVM model 
- Use the SVM model to predict the classes for the test dataset (**Wikipedia_Dev1.tsv**)



First, we must the local folder of our google drive

In [1]:
from google.colab import drive
drive.mount("/content/drive/")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


## Read the dataset

Please, read the file **README.md** (data/README.md), which explains the format of the dataset. 

Each line in **Wikipedia_Train1.tsv** provides information about a word. 
For example:


**3Z8UJEJOCZEG603II1EL4BE2PV593A	Syrian troops shelled a rebel-held town on Monday, sparking intense clashes that sent bloodied victims flooding into hospitals and clinics, activists said.	7	13	troops	10	10	0	0	0	0.0**

The fields for each word are:

*   An identifier for the text. (for instance 3Z8UJEJOCZEG603II1EL4BE2PV593A)
*   The text where the word occurs. (for instance *Syrian troops...., activists said.*)
*   The start and end positions (offsets) of the word in the text. (For instance: 7 13).
*   The word to be classified: *troops*.
*   The 6th columns shows the number of native annotators who reviewed the word.
*   The 7th columns shows the number of non-native annotators who reviewed the word.
*   The 8th columns shows the number of native annotators who classified the word as complex (with the label 1).
*   The 9th columns shows the number of non-native annotators who classified the word as complex (with the label 1).
*   The 10th columns is 1 if the word was classified as complex, and 0 if the word was classified as simple. 
*   The 11th columns provides the probability of the word to be classified as complex. 

In this approach, we are not going to exploit the 6th-9th columns. 

The dataset could be readed using different methods. I recommend you to use **csv** package. 

The following code shows how to read the dataset. 




In [11]:
import csv #package to work with tsv files
sst_home='drive/My Drive/Colab Notebooks/'
#Modify your folder
sst_home += 'TESI/6-TextSimplification/'
path=sst_home+'data/Wikipedia_Train1.tsv'

tsvin=open(path,'rb') 
tsvin = csv.reader(tsvin, delimiter='\t')

i=0
for row in tsvin:
    id=row[0] # id test
    sentence=row[1] #sentence text
    start=row[2] #start offset 
    end=row[3] #end offset + 1
    word=row[4] #word to be classified
    
    
    nat=row[5] #number of native annotators
    nonnat=row[6] #number of non-native annotators
    nat1=row[7] #number of native annotators who classified the words as 1
    nonnat1=row[8]#number of non-native annotators who classified the words as 1

    
    class_word=row[9] #class: 1 (complex) or 0 (simple)
    probability=row[10] #(total annotators who assigned 1/ total anotadores)

    #we show the 10 first words
    print('id',id)
    print('sentence',sentence)
    print('start',start)
    print('end',end)
    print('word',word)
    #print("anotadores",nat,nonnat,nat1,nonnat1)
    #print("clase y probabilidad", class_word,probability)
    print("clase:", class_word)
    
    print('\n')
    i+=1
    if i==10:
        break
    
    

('id', '3XU9MCX6VODXPI3L8I02CM94TFB2R7')
('sentence', "Normally , the land will be passed down to future generations in a way that recognizes the community 's traditional connection to that country .")
('start', '0')
('end', '8')
('word', 'Normally')
('clase:', '1')


('id', '3XU9MCX6VODXPI3L8I02CM94TFB2R7')
('sentence', "Normally , the land will be passed down to future generations in a way that recognizes the community 's traditional connection to that country .")
('start', '28')
('end', '34')
('word', 'passed')
('clase:', '1')


('id', '3XU9MCX6VODXPI3L8I02CM94TFB2R7')
('sentence', "Normally , the land will be passed down to future generations in a way that recognizes the community 's traditional connection to that country .")
('start', '15')
('end', '19')
('word', 'land')
('clase:', '0')


('id', '3XU9MCX6VODXPI3L8I02CM94TFB2R7')
('sentence', "Normally , the land will be passed down to future generations in a way that recognizes the community 's traditional connection to that count

## Feature set for word complex indentification

Now we are going to represent each word to be classified with a set of features, which should be useful for the task of CWI. So each word will be represented by a vector of features and its class (1 or 0). 

We will start with a small set of features, which have proved valuable for this task:
<ul>
<li>Length and number of syllables of a word. Complex words usually have a  longer length.
<li>Length of the sentence where the word occurs. 
A long sentence could indicate that the word is complex, but not always.</li>
</ul>

The following cell shows how to obtain these features for the first 50 words in the dataset:



In [12]:
#a function to count the number of syllables
syll = lambda w:len(''.join(c if c in"aeiouy"else' 'for c in w.rstrip('e')).split())

i=0
for row in tsvin:
    id=row[0] # id del párrafo donde ocurre la palabra
    sentence=row[1] #oración
    word=row[4] #palabra a clasificar

    len_word=len(word)
    num_syl=syll(word)
    len_sen=len(sentence)
    
    print(word,len_word,num_syl,len_sen)
    i+=1
    if i==50:
        break

('Aboriginal', 10, 4, 358)
('passing', 7, 2, 358)
('land', 4, 1, 358)
('legislaton', 10, 4, 358)
('rights', 6, 1, 358)
('preceded', 8, 3, 358)
('Australia', 9, 3, 358)
('Aboriginal', 10, 4, 358)
('number', 6, 2, 358)
('important', 9, 3, 358)
('Stockmen', 8, 2, 358)
('protests', 8, 2, 358)
('including', 9, 3, 358)
('Aboriginal', 10, 4, 358)
('Strike', 6, 1, 358)
('Yolngu', 6, 2, 358)
('Petition', 8, 3, 358)
('Bark', 4, 1, 358)
('Wave', 4, 1, 358)
('Walk-Off', 8, 1, 358)
('Hill', 4, 1, 358)
('Aboriginal', 10, 4, 358)
('Trust', 5, 1, 358)
('Lands', 5, 1, 358)
('established', 11, 4, 358)
('Act', 3, 0, 358)
('SA', 2, 0, 358)
('South', 5, 1, 358)
('Australian', 10, 3, 358)
('Aboriginal', 10, 4, 358)
('Lands', 5, 1, 358)
('Trust', 5, 1, 358)
('indigenous', 10, 4, 243)
('However', 7, 3, 243)
('Australians', 11, 3, 243)
('Australian', 10, 3, 243)
('Aborigines', 10, 4, 243)
('Torres', 6, 2, 243)
('Strait', 6, 1, 243)
('Islanders', 9, 2, 243)
('politically', 11, 5, 243)
('emerged', 7, 3, 243)
('a


Another useful feature could be the probability of the word w in a language model. The hypothesis is that complex words tend to be less frequent than simple words, which are usually more common.

We used the <a href='http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html'>The CMU-Cambridge Statistical Language Modeling Toolkit v2 </a> to create a language model, which was trained using the **Simple Wikipedia** corpus, with a total of 131,012 articles.  

Therefore, the model has been already generated and you can use it directly in this notebook. You can find it in the folder <a href='./LangModels/'>LangModels</a> (you can download from aulaglobal or from https://github.com/isegura/LanguageModel). 


So, for example:
- the file **wiki_1.wngram** contains a list of all possible unigrams that were found in the Simple Wikipedia corpus. For each unigram, its frequency in the corpus is also provided. 
- Similarly, **wiki_2.wngram** contains a list of all possible bigrams that were found in this corpus. For each bigram, its frequency in the corpus is also provided. 

First, we load these models into dictionaries. The ngrams will be the keys and the frequencies their corresponding values. 




In [13]:
def loadDic(path): 
    dic={}
    f=open(path,'r')
    for line in f:
        line=line.strip()
        pos=line.rfind(' ')
        key=line[0:pos-1]
        freq=line[pos+1]
        #print(key,freq)
        dic[key]=int(freq)
    f.close()
    return dic
print(sst_home)
path=sst_home+'LangModels/wiki_1.wngram'
unigrams=loadDic(path)
totalUnis=sum(unigrams.itervalues())
maxValue=max(unigrams.itervalues())

print('number of unigrams',len(unigrams.keys()),totalUnis,maxValue)

#Add the code to load the other models



drive/My Drive/Colab Notebooks/TESI/6-TextSimplification/
('number of unigrams', 105203, 244669, 9)


Once we have loaded the dictionary, we can already calculate the probability for each word in our dataset:

In [0]:
from __future__ import division

SCALED=10**4
def getProbability(ngram,dic,size):
    #pasamos a minúsculas y borramos  blancos de los extremos
    ngram=ngram.lower()
    
    ngram=ngram.strip()
    prob=0
    try:
        prob=dic[ngram]
        prob=prob/size
    except:
        #print(ngram, ' was not found!!!')
        pass

    prob=prob*SCALED
    return prob


Let us to show the features set (including already the probability) for the 50 first words:

In [21]:
i=0
for row in tsvin:
    id=row[0] # id del párrafo donde ocurre la palabra
    sentence=row[1] #oración
    word=row[4] #palabra a clasificar

    len_word=len(word)
    num_syl=syll(word)
    len_sen=len(sentence)
    probability=getProbability(word,unigrams,totalUnis)

    print(word,len_word,num_syl,len_sen,probability)
    i+=1
    if i==50:
        break

('major', 5, 2, 180, 0.326972358574237)
('key', 3, 1, 180, 0.040871544821779626)
('E-flat', 6, 1, 180, 0)
('major', 5, 2, 180, 0.326972358574237)
('Brahms', 6, 1, 86, 0)
('Johannes', 8, 3, 86, 0)
('Faur\xc3\xa9', 6, 1, 86, 0)
('C\xc3\xa9sar', 6, 1, 86, 0)
('Franck', 6, 1, 86, 0)
('Gabriel', 7, 2, 86, 0.040871544821779626)
('violin', 6, 2, 86, 0.24522926893067779)
('wrote', 5, 1, 86, 0.040871544821779626)
('sonatas', 7, 3, 86, 0)
('major', 5, 2, 86, 0.326972358574237)
('connection', 10, 3, 130, 0.326972358574237)
('Kreutzer', 8, 2, 130, 0)
('Beethoven', 9, 3, 130, 0)
('Cropper', 7, 2, 130, 0)
('Sonata', 6, 3, 130, 0.20435772410889813)
('Peter', 5, 2, 130, 0.24522926893067779)
('major', 5, 2, 130, 0.326972358574237)
('fullest', 7, 2, 130, 0)
('sounding', 8, 2, 130, 0)
('violin', 6, 2, 130, 0.24522926893067779)
('key', 3, 1, 130, 0.040871544821779626)
('Friedrich', 9, 2, 214, 0)
('According', 9, 2, 214, 0)
('Christian', 9, 2, 214, 0.1634861792871185)
('Schubart', 8, 2, 214, 0)
('Daniel', 

## Training a SVM model to classify complex words

Now we are going to train a SVM classifier to train a model using the training dataset. Then, we will apply this model on the test dataset to predict its classes and calculate the metrics precision, recall and Fmeasure using sklearn. 

The input of the SVM classifier will be the representation of the instances. An instance for each word to be classified. Therefore, the input can be represented as a matrix of n rows by m  columns, where n is the number of words (instances or lines in the training dataset) and m is the number of features plus the class . 

The following cell calculates the total number of instances (number of rows):


In [22]:
def getLines(path):
    f=open(path,'r') 
    numExamples = sum(1 for line in f)  # fileObject is your csv.reader
    f.close()
    return numExamples


path=sst_home+'data/Wikipedia_Train1.tsv'
print("Num. de ejemplos", getLines(path))


('Num. de ejemplos', 4830)


The following function first create an empty matrix (with dimension 4830 rows and 5 columns = 4 features plus the class). 

In [24]:
import numpy as np

def getMatrix(path):
    
    numExamples=getLines(path)
   
    #WArning!!!,if you add more feature, you should increase this value
    numFeatures=5 
    #create the empty matrix
    matrix = np.empty(shape=[numExamples, numFeatures]) 
    
    #open the file
    tsvin=open(path,'rb') 
    tsvin = csv.reader(tsvin, delimiter='\t')
    
    indexRow=0
    
   
    for row in tsvin:
        id=row[0] # id text
        sentence=row[1] #text
        word=row[4] #word
        class_word=row[9] #class: 1 o 0


        len_word=len(word)
        num_syl=syll(word)
        len_sen=len(sentence)
        prob=getProbability(word,unigrams,totalUnis)
        
        #create a vector with dimension numFeatures
        vector_fet= np.arange(numFeatures)
        
        #we add the features: 

        vector_fet[0]=len_word
        vector_fet[1]=num_syl
        vector_fet[2]=len_sen
        vector_fet[3]=prob
        #el último la clase
        vector_fet[4]=class_word
        
        #por último, reemplazamos el vector para el ejemplo con indexRow prob_2
        matrix[indexRow]=vector_fet
        
        #incrementamos en 1 para poder indicar el índice del siguiente ejemplo
        indexRow+=1
        
        

        
        
    
    return matrix


path=sst_home+'data/Wikipedia_Train1.tsv'
matrix_train=getMatrix(path)
print(matrix_train)

    

[[  8.   3. 144.   0.   1.]
 [  6.   2. 144.   0.   1.]
 [  4.   1. 144.   0.   0.]
 ...
 [ 10.   4. 189.   0.   0.]
 [  7.   2. 189.   0.   0.]
 [  5.   2. 189.   0.   0.]]


We should also create the matrix for the test dataset

In [26]:
path=sst_home+'data/Wikipedia_Dev1.tsv'
matrix_test=getMatrix(path)
print(matrix_test)


[[  4.   1. 160.   0.   1.]
 [ 13.   5. 160.   0.   1.]
 [  4.   1. 160.   0.   1.]
 ...
 [  4.   1. 133.   0.   0.]
 [  9.   3. 133.   0.   1.]
 [  4.   1. 133.   0.   1.]]


### Training and testing the model

We can already train the model. In particular, we will use the SVM classifier (we could use any binary classifier). SKlearn provides several implementations for SVM. We will use this: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html. 

The first column of the matrices represent the outputs (classes) for each instance. 


In [0]:
numCol=matrix_dev.shape[1]

X_train=matrix_train[:,0:numCol-1]
y_train=matrix_train[:, -1] #last column

numCol=matrix_dev.shape[1]

X_dev=matrix_test[:,0:numCol-1]
y_dev=matrix_test[:, -1] #last column

Finally, we train our model and use it to predict the outputs for the test dataset:

In [32]:
from sklearn import svm

#nos ayudará a obtener muy fácilmente las métricas precisión, recall y f1
from sklearn.metrics import precision_recall_fscore_support as pr
from sklearn.metrics import classification_report


#we could add other classifiers.... 
classifiers = [
    svm.SVC()
]

#en nuestro caso sólo se ejecutará una vez, porque sólo tnemos un algoritmo
for item in classifiers:
    print(item)
    clf = item
    #entrenamos
    clf.fit(X_train, y_train)
    #predicimos
    predictions=clf.predict(X_dev)
    print('\n\n')

    print(classification_report(y_dev, predictions))

    #obtenemos precisión, recall y f1 comparando el gold standard (y_dev) con las predicciones
    #bPrecis, bRecall, bFscore, bSupport = pr(y_dev, predicted, average='binary')
    #mostramos resultados
    #print(bPrecis, bRecall, bFscore, bSupport)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)



              precision    recall  f1-score   support

         0.0       0.67      0.73      0.70       340
         1.0       0.61      0.55      0.58       265

   micro avg       0.65      0.65      0.65       605
   macro avg       0.64      0.64      0.64       605
weighted avg       0.65      0.65      0.65       605



### G1: a metric for text simplification

In  CWI 2018 Shared task, the teams also used an additional metric to compare their sytems. 

Please, look this metric and implement it for our previous classifier. 
