<a href="https://colab.research.google.com/github/lightsamurai/API/blob/master/CL_aphasia_Copy1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APHASIC patients' linguistic production DIAGNOSIS with WORD2VEC

Aim of this work is to rank the severeness of the aphasia by applying Word2Vec tasks to our corpus, which was collected from the Aphasia Bank online database (https://aphasia.talkbank.org/). Here are collected transcriptions of aphasic patients' productions, spoken by accomplishing specific and standardized tasks of the Aphasia Bank Protocol (i.e. to describe the scene portraited in a picture the interviewer shows the patient). Word2vec uses a neural network model to learn word associations from a large corpus of text.

We collected our input data of 2 lists of word pairs from the Aphasia Bank online database. The first list contains ˜100 EN target/response word pairs, where the target is the word which describes the scene (e.g. "ball") and the response is the word the patient produces (e.g. "sphere"). The second list contains ˜30 IT target/response word pairs.

The semantic model we used to get the vector representations of our target/response English words are the pre-trained 'word2vec-google-news-300' vectors. To run the cosine similarity task we are using the built-in wv.similarity function of Word2Vec, which takes as input our word pairs and gives as output their cosine similarity. To accomplish the same task for our Italian word pairs we trained a model with the SkipGram algorithm of Word2Vec basing on 10 million word from Wikipedia using plainstream.

We then looked for the most similar target/response word in our lists to get a severeness ranking for the aphasic patient's. We then look for the "most similar" words of the most common words in our databases (i.e. "ball") and see if among our response words we'll find some of the "most similar" words according to our models.

Further steps (in progress):

_give each word in our target/response lists a difficulty ranking, expressed with a float and basing on the lenght of the word. This difficulty evaluation (basing on the word lenght) will normally used by logopedists or in medical context dealing with language impairment. We'll then compute a correlation between the target/response word cosine similarity and the difficulty ranking of the target word. By doing that, we'll get a "reweighting" of how good the patient accomplished the task, considering also the difficulty of retrieving the target word

_look if our target/response word is present in MEN database (any order the couple shows up) and if yes, compare the cosine similarity we got from the model for our word pairs and their similarity evaluated by human annotators

In [None]:
# First, I import our English input data and give a look inside the list of word pairs
# I save our pairs as .csv file, i.e. comma separated value

import pandas as pd
import numpy

# Create a dataframe from csv
df = pd.read_csv("/Users/silviafabbi/Desktop/input_pairs_EN.csv", "r", delimiter=',', engine='python')

# User list comprehension to create a list of lists from Dataframe rows
list_of_rows = [list(row) for row in df.values]

print(df)
type(df)

      ball    bound
0     ball     bask
1   window     womb
2   broken   bottom
3     word    money
4     rain      run
..     ...      ...
91     age     days
92     say      see
93    foot     head
94   woman      man
95     put     foot

[96 rows x 2 columns]


pandas.core.frame.DataFrame

In [None]:
def tupleize(file):
    output = [] # I initialize an array (=list)
    with open(file, "r") as input: 
        # I open our file in read only modality
        # .readline() method allows us to read our data line by line using the \n char inside our data
        # The iterator goes through the data
        line = input.readline()
        # I take the first line of our input
        while line:
                
        # Whith "while" I set a condition for "line" for my iterator:
        # As long as "line" variable cointains something (i.e. as long as my iterator finds a new line to go through),
        # our algorithm applies .rstrip() to the line to cancel the \n command
        # By doing this, our list called "output" becomes a list of tuples with a number of objects equals to those elements which contain a ", "
        # We substitute the content of the line with ", " as separator and append the "cleaned" line to our list "output"
            nuple = tuple(line.rstrip().split(", "))
            output.append(nuple)
            line = input.readline();
        # We continue cycling through our data
        # When input.readline() doesn't find a line anymore, "while" condition is not satisfied anymore
        # As a consequence, our output is an empty line and our cycle stops
        

    return output


In [None]:
# We apply our function to our dataset to transform it in a list of tuples

input_EN = tupleize("/Users/silviafabbi/Desktop/input_pairs_EN.csv")

print(input_EN)

[('ball', 'bound'), ('ball', 'bask'), ('window', 'womb'), ('broken', 'bottom'), ('word', 'money'), ('rain', 'run'), ('dog', 'guy'), ('bark', 'biting'), ('cat', 'god'), ('girl', 'guy'), ('tree', 'train'), ('ball', 'bald'), ('ball', 'barn'), ('ball', 'banks'), ('king', 'cleans'), ('king', 'kins'), ('glass', 'grass'), ('woman', 'man'), ('somewhere', 'someone'), ('wear', 'words'), ('shoe', 'scene'), ('boy', 'man'), ('mother', 'wife'), ('school', 'cool'), ('girl', 'earl'), ('slipper', 'sipper'), ('lamp', 'lap'), ('umbrella', 'ball'), ('short', 'sort'), ('boy', 'boil'), ('woman', 'man'), ('fourth', 'force'), ('door', 'window'), ('glass', 'gas'), ('soaking', 'stoking'), ('ladder', 'window'), ('kick', 'cook'), ('window', 'door'), ('umbrella', 'comb'), ('umbrella', 'bread'), ('catch', 'kitchen'), ('soccer', 'sock'), ('mother', 'daughter'), ('lamp', 'lights'), ('give', 'gay'), ('rescue', 'like'), ('dog', 'door'), ('bark', 'talk'), ('kick', 'cook'), ('show', 'go'), ('both', 'bow'), ('cat', 'hat')

In [None]:
with open("input_pairs_X.csv", "w+") as printout:
    for i in input_EN:
        words = tupleize((input_EN))
        for x in words:
            printout.write(tupleize(i))
            printout.write("\n")
            printout.close()

AttributeError: 'list' object has no attribute 'items'

In [None]:
pat_EN = tupleize("/Users/silviafabbi/Desktop/pat_input_EN.csv")

print(pat_EN)

[('1', 'ball', 'bound'), ('1', 'ball', 'bask'), ('1', 'window', 'womb'), ('1', 'broken', 'bottom'), ('1', 'word', 'money'), ('1', 'rain', 'run'), ('1', 'dog', 'guy'), ('1', 'bark', 'biting'), ('1', 'cat', 'god'), ('1', 'girl', 'guy'), ('1', 'tree', 'train'), ('1', 'ball', 'bald'), ('1', 'ball', 'barn'), ('1', 'ball', 'banks'), ('1', 'king', 'cleans'), ('1', 'king', 'kins'), ('1', 'glass', 'grass'), ('1', 'woman', 'man'), ('1', 'somewhere', 'someone'), ('1', 'wear', 'words'), ('1', 'shoe', 'scene'), ('2', 'boy', 'man'), ('2', 'mother', 'wife'), ('2', 'school', 'cool'), ('2', 'girl', 'earl'), ('2', 'slipper', 'sipper'), ('3', 'lamp', 'lap'), ('3', 'umbrella', 'ball'), ('3', 'short', 'sort'), ('4', 'boy', 'boil'), ('4', 'woman', 'man'), ('5', 'fourth', 'force'), ('5', 'door', 'window'), ('6', 'glass', 'gas'), ('6', 'soaking', 'stoking'), ('6', 'ladder', 'window'), ('7', 'kick', 'cook'), ('7', 'window', 'door'), ('7', 'umbrella', 'comb'), ('7', 'umbrella', 'bread'), ('7', 'umbrella', 'read

In [None]:
def get_sim(data):
    with open("list_of_.csv", "w") as output:
        
        for w1, w2, w3 in data:
            print('%r\t%r\t%.4f' % (w2, w3, wv.similarity(w2, w3)))
            
            output.write('%r\t%r\t%.4f' % (w2, w3, wv.similarity(w2, w3)))

                    
get_sim(pat_EN)

output.close()

'ball'	'bound'	0.0339
'ball'	'bask'	0.0549
'window'	'womb'	0.1947
'broken'	'bottom'	0.1166
'word'	'money'	0.2119
'rain'	'run'	0.1114
'dog'	'guy'	0.2742
'bark'	'biting'	0.3404
'cat'	'god'	0.1099
'girl'	'guy'	0.3644
'tree'	'train'	0.1557
'ball'	'bald'	0.1118
'ball'	'barn'	0.0713
'ball'	'banks'	0.0107
'king'	'cleans'	0.0292
'king'	'kins'	0.1493
'glass'	'grass'	0.1117
'woman'	'man'	0.7664
'somewhere'	'someone'	0.4419
'wear'	'words'	0.0561
'shoe'	'scene'	0.0908
'boy'	'man'	0.6825
'mother'	'wife'	0.7551
'school'	'cool'	0.1263
'girl'	'earl'	0.2010
'slipper'	'sipper'	0.1738
'lamp'	'lap'	0.1483
'umbrella'	'ball'	0.0570
'short'	'sort'	0.1629
'boy'	'boil'	0.1139
'woman'	'man'	0.7664
'fourth'	'force'	0.1147
'door'	'window'	0.6213
'glass'	'gas'	0.0585
'soaking'	'stoking'	0.2246
'ladder'	'window'	0.3029
'kick'	'cook'	0.0898
'window'	'door'	0.6213
'umbrella'	'comb'	0.0946
'umbrella'	'bread'	0.0462
'umbrella'	'read'	-0.0125
'catch'	'kitchen'	0.0604
'ball'	'bell'	0.0863
'soccer'	'sock'	0.1546
'mother'	

NameError: name 'output' is not defined

In [None]:
# Now we can transform every word in our dataset in a vector using KeyedVectors data from gensim

In [None]:
# MODEL 1 - EN - vectors from GoogleNews

# We use gensim to import a word2vec model pretrained on Google News 
# We load the pretrained model of the type #gensim.models.keyedvectors.Word2VecKeyedVectors
# using the gensim standard method .load()

# This type of pretrained model cannot be refined with additional data: KeyedVectors it's an immutable model
# but has the advantage of saving RAM by dealing with huge quantity of data
# The 'word2vec-google-news-300' are pre-trained vectors trained on Google News dataset (about 100 billion words)
# The model contains 300-dimensional vectors for 3 million words and phrases

from gensim.models import Word2Vec
import gensim.downloader as api

In [None]:
# I load my KeyedVectors based on GoogleNews
# This will be the vector representation of the ca. 100 English words we collected from the AphasiaBank corpus

wv = api.load('word2vec-google-news-300')
type(wv) # gensim.models.keyedvectors.Word2VecKeyedVectors

gensim.models.keyedvectors.Word2VecKeyedVectors

In [None]:
# We're going to compare the semantic similarity of the target/response word pairs
# using wv = api.load('word2vec-google-news-300') as our pretrained model
# input_EN = my list of tuples made by target/response word

for w1, w2 in input_EN:
    print('%r\t%r\t%.4f' % (w1, w2, wv.similarity(w1, w2)))
    
    
# The last line of the code means following:    
# "%r\t%r\t" are like "jolly" charachters, which will be substituted with the content of out n-uple which went through "%"
# Here we have a "%r" string, followed by tabulation "\t", followed my another string "%r" and "\t"
# "% .4f" closes with a float with 4 numbers after the ","
# Python knows it shall substitute - in the specified order - those sequences (strings/tabulation)
# with the content of the variables which went trough "%" before printing them
# The % symbol in Python is called the Modulo Operator
# It returns the remainder of dividing the left hand operand by right hand operand
# It's used to get the remainder of a division problem

'ball'	'bound'	0.0339
'ball'	'bask'	0.0549
'window'	'womb'	0.1947
'broken'	'bottom'	0.1166
'word'	'money'	0.2119
'rain'	'run'	0.1114
'dog'	'guy'	0.2742
'bark'	'biting'	0.3404
'cat'	'god'	0.1099
'girl'	'guy'	0.3644
'tree'	'train'	0.1557
'ball'	'bald'	0.1118
'ball'	'barn'	0.0713
'ball'	'banks'	0.0107
'king'	'cleans'	0.0292
'king'	'kins'	0.1493
'glass'	'grass'	0.1117
'woman'	'man'	0.7664
'somewhere'	'someone'	0.4419
'wear'	'words'	0.0561
'shoe'	'scene'	0.0908
'boy'	'man'	0.6825
'mother'	'wife'	0.7551
'school'	'cool'	0.1263
'girl'	'earl'	0.2010
'slipper'	'sipper'	0.1738
'lamp'	'lap'	0.1483
'umbrella'	'ball'	0.0570
'short'	'sort'	0.1629
'boy'	'boil'	0.1139
'woman'	'man'	0.7664
'fourth'	'force'	0.1147
'door'	'window'	0.6213
'glass'	'gas'	0.0585
'soaking'	'stoking'	0.2246
'ladder'	'window'	0.3029
'kick'	'cook'	0.0898
'window'	'door'	0.6213
'umbrella'	'comb'	0.0946
'umbrella'	'bread'	0.0462
'catch'	'kitchen'	0.0604
'soccer'	'sock'	0.1546
'mother'	'daughter'	0.8706
'lamp'	'lights'	0.4994
'give'

In [None]:
wv.most_similar("ball", topn=20)

[('balls', 0.6992625594139099),
 ('upfield', 0.6896207928657532),
 ('downfield', 0.6390728950500488),
 ('dribbler', 0.6218727827072144),
 ('balll', 0.6199932098388672),
 ('dribble', 0.616877555847168),
 ('ball_squirted', 0.6110137701034546),
 ('leftfooted', 0.6020259857177734),
 ('puck', 0.5981724262237549),
 ('mishit', 0.5948782563209534),
 ('lofted', 0.5933606028556824),
 ('theball', 0.5924203395843506),
 ('bobbling', 0.5848650336265564),
 ('dinked', 0.5820186138153076),
 ('dribbles', 0.5811805725097656),
 ('beautifully_flighted', 0.5757741928100586),
 ('mistimes', 0.5747321844100952),
 ('onsides', 0.5730898380279541),
 ('perfectly_flighted', 0.5724466443061829),
 ('deadball', 0.5708563923835754)]

For "ball" we have in our input pairs following matches: {(ball, bound), (ball, bask), (ball, bald), (ball, barn),(ball, banks), (umbrella, ball)} -> none of them shows up as a result of the "most similar" for Google vectors

In [None]:
# MODEL 2 - IT - vectors from a model from plainstream/wikipedia
# model ITA from Wikipedia - 10 mln words
# 
# After training a w2v model in ITA, we saved it as a .model file
# The model was trained using the Skip Gram algorithm of Word2Vec

In [None]:
import nltk
import plainstream
import gensim
from gensim.models import KeyedVectors

# We are introducing the time module: it has many uses, but here we are just using the .time() method
# to measure the execution time of a process
import time

In [None]:
some_wiki # <generator object get_text at 0x7fce21e0ac10>

In [None]:
s = time.time()
# Here we are asking plainstream to give us a certain amount of words (10 milion in this case)
# NB: a plainstream.get_text() obejct is a generator, which is empty after one use
# Generator functions allow you to declare a function that behaves like an iterator
# i.e. it can be used in a for loop
some_wiki = plainstream.get_text("it", max_words=10000000, tokenize=True)
some_text = []

# we want to make sure that every word is lower case. Because some_wiki generates lists
# of lists of tokens (i.e: tokenized sentences) we need to nest a couple of for loops in order to 
# reach the strings that we want to manipulate

for tokens_list in some_wiki:
    temp = []
    for word in tokens_list:
        temp.append(word.lower())
    some_text.append(temp)
e = time.time()
print(e-s)

In [None]:
s = time.time()
# this is where we train the model. We are using a couple of parameters here, but the most
# relevant is "sg", which means that we are using the skipgram algorithm
model_ita = gensim.models.word2vec.Word2Vec(sentences=some_text, size=300, min_count=4, sg=1)
e = time.time()
print(e-s)

In [None]:
# we save our model

model_ita.save("/Users/silviafabbi/Desktop/ord2vec_10mil_wiki.model")

In [None]:
# reload the trained model

model_ita = KeyedVectors.load("/Users/silviafabbi/Desktop/ord2vec_10mil_wiki.model")

In [None]:
coppie = tupleize("/Users/silviafabbi/Desktop/input_pairs_IT.csv")
coppie

[('abbaia', 'saluta'),
 ('scala', 'piede'),
 ('calciare', 'lanciare'),
 ('prendere', 'chiedere'),
 ('ombrello', 'acqua'),
 ('bagnato', 'dormire'),
 ('rincorrere', 'prendere'),
 ('abbaia', 'gioca'),
 ('vetro', 'vetro'),
 ('ombrello', 'pioggia'),
 ('palla', 'finestra'),
 ('rompere', 'venire'),
 ('prendere', 'mettere'),
 ('salvare', 'trascinare'),
 ('sorellastre', 'donne'),
 ('topi', 'scoiattoli'),
 ('fata', 'amante'),
 ('fata', 'aspetta'),
 ('fata', 'la'),
 ('piove', 'pioggia'),
 ('bloccare', 'restare'),
 ('figlie', 'sorelle'),
 ('scomparire', 'spegnere'),
 ('lanciare', 'fare'),
 ('uscire', 'partire'),
 ('palla', 'palo')]

In [None]:
for w1, w2 in coppie:
    print('%r\t%r\t%.4f' % (w1, w2, model_ita.wv.similarity(w1, w2)))
    
# in case the response word matches the target word (like "pioggia" vs "pioggia") the similarity is 1 (maximum)

'abbaia'	'saluta'	0.7050
'scala'	'piede'	0.1823
'calciare'	'lanciare'	0.5785
'prendere'	'chiedere'	0.5782
'ombrello'	'acqua'	0.4904
'bagnato'	'dormire'	0.4128
'rincorrere'	'prendere'	0.4610
'abbaia'	'gioca'	0.4616
'vetro'	'vetro'	1.0000
'ombrello'	'pioggia'	0.4927
'palla'	'finestra'	0.4999
'rompere'	'venire'	0.4567
'prendere'	'mettere'	0.5360
'salvare'	'trascinare'	0.6874
'sorellastre'	'donne'	0.4786
'topi'	'scoiattoli'	0.6643
'fata'	'amante'	0.5440
'fata'	'aspetta'	0.5140
'fata'	'la'	0.2890
'piove'	'pioggia'	0.5240
'bloccare'	'restare'	0.5101
'figlie'	'sorelle'	0.8361
'scomparire'	'spegnere'	0.6290
'lanciare'	'fare'	0.4572
'uscire'	'partire'	0.3452
'palla'	'palo'	0.4779


In [None]:
model_ita.wv.most_similar("palla", topn=20)

[('pallina', 0.7703580260276794),
 ('racchetta', 0.7295913696289062),
 ('boccia', 0.7282717227935791),
 ('pedina', 0.7159955501556396),
 ('buca', 0.7057750225067139),
 ("all'indietro", 0.7046733498573303),
 ('virata', 0.7045900821685791),
 ('scopa', 0.7014609575271606),
 ("l'asta", 0.6888625621795654),
 ('fallo', 0.6849193572998047),
 ("l'avversario", 0.6848279237747192),
 ('battitore', 0.6810046434402466),
 ('tira', 0.6789371967315674),
 ('polso', 0.6747191548347473),
 ("sull'acqua", 0.6742557883262634),
 ("l'hammer", 0.6721363067626953),
 ('avversaria', 0.6710895299911499),
 ('patta', 0.6709756255149841),
 ('capriola', 0.6647751331329346),
 ('alzata', 0.6643388271331787)]

For "palla" we have in our input pairs following matches: {(palla, finestra), (palla, palo)} -> none of them shows up as a result of the "most similar" basing on our Wikipedia model

In [None]:
# Let's now create the difficulty ranking for our italian pairs

def main_it():
    scores = []
    max = 0

    with open("/Users/silviafabbi/aphasia/test/data/italian.csv", "r") as words:
        word = words.readline()
        while word:
            word = word.rstrip() # rstrip cancels the new line
            difficulty = len(word)
            # Keep track of the maximum difficulty for this dataset
            if max < difficulty:
                max = difficulty

            scores.append((word, difficulty))
            word = words.readline();

    print("Max score is: " + str(max))

    with open("/Users/silviafabbi/Desktop/difficulty_IT_x.csv", "w") as diff:
        for tuple in scores:
            diff.write(", ".join([tuple[0], str(tuple[1] / max)]))
            diff.write("\n")

if __name__ == "__main__":
    main_it()
else:
    print("Not to be used as a module.")


Max score is: 31


In [None]:
# I want my difficulty list to become a dictionary (unordered list of tuples: key = word, value = difficulty)

def tupleize(file):
    output = [] # inizalizzo un array
    with open(file, "r") as input: # file è path aperto in modalità read only, prendo prima linea
        # finché c'è dentro line qualcosa, applico rstrip per togliere il comando \n
        # line pulita viene splittata dal marcatore , spazio - torna un array (da tupla) di oggetti tanti quanti sono oggetti demarcati da virgola spazio
        # non vogliamo un array, vogliamo una tupla
        # array = set = list(python)
        # nuple è una tupla e aggiungo una lista di tuple
        # infine legge linea dopo e torna all'inizio
        # quando ha input.realine() vuota e while non è soddisfatto torna lista vuota fine esecuzione
        
        line = input.readline()
        while line:
            nuple = tuple(line.rstrip().split(", "))
            output.append(nuple)
            line = input.readline();

    return output

difficulty = dict(tupleize("/Users/silviafabbi/Desktop/difficulty_IT_x.csv"))

In [None]:
# lines = difficulty.readlines()

# I want to correlate my SIMILARITY_CHART_IT (made of tuples = w1, w2, cos_sim)
# with the difficulty = DIFFICULTY_IT_x.CSV (made of tuples = w, difficulty_score)

# I have to give w1, w2 a difficulty score

def diff_rank():
    diff = [] # I initialize my list of tuples = target word + correlation btw cos_sim and difficulty
    max = 0
    for i in lines:
        if word in difficulty:
            with open("/Users/silviafabbi/Desktop/mean_IT_x.csv", "w") as diff:
                for i in lines:
                    diff.write(", ".join([tuple[0], str(tuple[1] / max)]))
                    diff.write("\n")
                    diff.close()
                    
if __name__ == "__diff_rank__":
    diff_rank()
else:
    print("No way!")

No way!
