Skip to content

ULMFit model for the Italian language / creation of a parallel corpus

License

Notifications You must be signed in to change notification settings

ikros98/ULMFiT-for-Italian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstract

Our goal is to obtain a language model fine tuned and from it obtaining a emotion classifier by text. In the end we will be able to generate infinite sentences with the labels assigned by the emotion classifier.

Dataset

The datasets are two and they have been generated from open datasets:

  • The first has been generated from a Open Subititle's dataset, using in particular the Italian and english languages.
  • The second has been generated from the Ted's talks, always in Italian and english, and it has been generated by web scraping the Ted's talks transcripts

Later we used an english emotion classifier, based on Bert technology, to classify the english sentences and obtain the predicted label that we use to build the Italian dataset, by assigning them to the corresponding Italian sentences.

We also created a third dataset by merging the Sub and Ted datasets.

ULMFiT

Here we load the classifiers trained by ULMFiT technology based on a previous work.

#Models load

learn_sub = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Sub/', 'export.pkl')
learn_ted = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Ted2/', 'export-ted2.pkl')
learn_merged = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Merged/', 'export.pkl')

We use this method to preprocess the data and remove special characters that doesn't affect the sentiment analysis (i.e. dots, commas, ...)

def preprocess(s):
  import re
  s = s.lower()
  s = re.sub(r"[^a-zA-ZÀ-ú</>!?♥♡\s\U00010000-\U0010ffff]", ' ', s)
  s = re.sub(r"\s+", ' ', s)
  s = re.sub(r'(\w)\1{2,}',r'\1\1', s)
  s = re.sub ( r'^\s' , '' , s )
  s = re.sub ( r'\s$' , '' , s )
  return s
#Sentences to test

test_text = pd.DataFrame([
    'Questo è il giorno più felice della mia vita',
    'Penso di potermela cavare questa volta',
    'L\'altro giorno sono andato a mare',
    'Domani andro al mare, che bello',
    'Credo di amarti',
    'Sei una persona orribile',
    'Vorrei viaggiare, sarebbe bellissimo',
    'Non voglio più uscire di casa',
    'La giornata oggi è pessima, non mi va di uscire',
    'Oggi sono malinconico, non mi va di uscire',
    'Mi sento triste, mi manca mio figlio',
    "Mi manca mio figlio"
])

print('Labels:\t',learn_merged.data.classes,'\n')

for s in test_text[0]:
  print('\n\nFrase: ',s)
  res = learn_sub.predict(preprocess(s))
  print('\n\tSub model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
  res = learn_ted.predict(preprocess(s))
  print('\n\tTed model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
  res = learn_merged.predict(preprocess(s))
  print('\n\tMerged model: \n\t\tPredicted label: ', res[0], '\tProbabilities: ', res[2])
Labels:     ['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness'] 



Frase:  Questo è il giorno più felice della mia vita

    Sub model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0189, 0.2521, 0.0747, 0.4743, 0.0511, 0.1289])

    Ted model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0087, 0.1474, 0.1952, 0.4646, 0.0576, 0.1264])

    Merged model: 
        Predicted label:  joy     Probabilities:  tensor([0.0037, 0.4122, 0.0781, 0.3479, 0.0423, 0.1158])


Frase:  Penso di potermela cavare questa volta

    Sub model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0919, 0.1144, 0.3329, 0.3894, 0.0640, 0.0073])

    Ted model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0156, 0.0770, 0.6636, 0.1331, 0.0602, 0.0506])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0298, 0.1212, 0.2287, 0.5359, 0.0721, 0.0123])


Frase:  L'altro giorno sono andato a mare

    Sub model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0525, 0.1982, 0.3469, 0.1801, 0.1552, 0.0670])

    Ted model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0101, 0.0346, 0.4815, 0.3753, 0.0555, 0.0431])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0227, 0.1902, 0.2247, 0.4212, 0.0987, 0.0424])


Frase:  Domani andro al mare, che bello

    Sub model: 
        Predicted label:  joy     Probabilities:  tensor([0.0323, 0.5608, 0.0793, 0.2537, 0.0263, 0.0477])

    Ted model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0082, 0.0242, 0.5839, 0.3462, 0.0221, 0.0154])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0670, 0.2664, 0.0882, 0.4957, 0.0360, 0.0468])


Frase:  Credo di amarti

    Sub model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0212, 0.2248, 0.2798, 0.3959, 0.0726, 0.0056])

    Ted model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0043, 0.0394, 0.5940, 0.3241, 0.0286, 0.0096])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0368, 0.2224, 0.1299, 0.5295, 0.0596, 0.0218])


Frase:  Sei una persona orribile

    Sub model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0307, 0.2498, 0.1810, 0.5204, 0.0094, 0.0087])

    Ted model: 
        Predicted label:  neutral     Probabilities:  tensor([0.0080, 0.0261, 0.5722, 0.3457, 0.0396, 0.0084])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0708, 0.2134, 0.2204, 0.4588, 0.0166, 0.0200])


Frase:  Vorrei viaggiare, sarebbe bellissimo

    Sub model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0091, 0.1212, 0.0607, 0.7423, 0.0539, 0.0127])

    Ted model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0031, 0.0408, 0.1390, 0.5835, 0.2169, 0.0168])

    Merged model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0037, 0.0731, 0.0312, 0.5713, 0.2942, 0.0265])


Frase:  Non voglio più uscire di casa

    Sub model: 
        Predicted label:  pessimism     Probabilities:  tensor([0.1073, 0.0567, 0.0913, 0.1468, 0.5388, 0.0590])

    Ted model: 
        Predicted label:  optimism     Probabilities:  tensor([0.0039, 0.0251, 0.3116, 0.3779, 0.2718, 0.0098])

    Merged model: 
        Predicted label:  pessimism     Probabilities:  tensor([0.1089, 0.0430, 0.0988, 0.2247, 0.4779, 0.0468])


Frase:  La giornata oggi è pessima, non mi va di uscire

    Sub model: 
        Predicted label:  pessimism     Probabilities:  tensor([0.1464, 0.1910, 0.1061, 0.1059, 0.2537, 0.1969])

    Ted model: 
        Predicted label:  pessimism     Probabilities:  tensor([0.0127, 0.1391, 0.1608, 0.0785, 0.4944, 0.1144])

    Merged model: 
        Predicted label:  joy     Probabilities:  tensor([0.1178, 0.2558, 0.2044, 0.2261, 0.1100, 0.0858])


Frase:  Oggi sono malinconico, non mi va di uscire

    Sub model: 
        Predicted label:  joy     Probabilities:  tensor([0.2673, 0.3016, 0.1067, 0.0770, 0.1379, 0.1095])

    Ted model: 
        Predicted label:  pessimism     Probabilities:  tensor([0.0246, 0.1843, 0.1614, 0.1001, 0.3683, 0.1613])

    Merged model: 
        Predicted label:  joy     Probabilities:  tensor([0.1887, 0.2757, 0.1958, 0.1567, 0.1086, 0.0745])


Frase:  Mi sento triste, mi manca mio figlio

    Sub model: 
        Predicted label:  sadness     Probabilities:  tensor([0.0336, 0.1779, 0.0621, 0.1184, 0.0442, 0.5639])

    Ted model: 
        Predicted label:  joy     Probabilities:  tensor([0.0064, 0.4620, 0.0547, 0.1508, 0.1481, 0.1780])

    Merged model: 
        Predicted label:  sadness     Probabilities:  tensor([0.0180, 0.3835, 0.0439, 0.0921, 0.0692, 0.3933])


Frase:  Mi manca mio figlio

    Sub model: 
        Predicted label:  joy     Probabilities:  tensor([0.0810, 0.2809, 0.1750, 0.1954, 0.0569, 0.2109])

    Ted model: 
        Predicted label:  joy     Probabilities:  tensor([0.0134, 0.4548, 0.0661, 0.2426, 0.0582, 0.1649])

    Merged model: 
        Predicted label:  sadness     Probabilities:  tensor([0.0365, 0.2914, 0.0654, 0.1081, 0.0856, 0.4130])

Open Subtitle dataset

Here we show the prediction probabilities for each previous test sentences with Open Subtitle model, and its confusion matrix.

print(learn_sub.data.classes,'\n')
for s in test_text[0]:
  res = learn_sub.predict(preprocess(s))
  print(s,res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness'] 

Questo è il giorno più felice della mia vita tensor([0.0189, 0.2521, 0.0747, 0.4743, 0.0511, 0.1289])
Penso di potermela cavare questa volta tensor([0.0919, 0.1144, 0.3329, 0.3894, 0.0640, 0.0073])
L'altro giorno sono andato a mare tensor([0.0525, 0.1982, 0.3469, 0.1801, 0.1552, 0.0670])
Domani andro al mare, che bello tensor([0.0323, 0.5608, 0.0793, 0.2537, 0.0263, 0.0477])
Credo di amarti tensor([0.0212, 0.2248, 0.2798, 0.3959, 0.0726, 0.0056])
Sei una persona orribile tensor([0.0307, 0.2498, 0.1810, 0.5204, 0.0094, 0.0087])
Vorrei viaggiare, sarebbe bellissimo tensor([0.0091, 0.1212, 0.0607, 0.7423, 0.0539, 0.0127])
Non voglio più uscire di casa tensor([0.1073, 0.0567, 0.0913, 0.1468, 0.5388, 0.0590])
La giornata oggi è pessima, non mi va di uscire tensor([0.1464, 0.1910, 0.1061, 0.1059, 0.2537, 0.1969])
Oggi sono malinconico, non mi va di uscire tensor([0.2673, 0.3016, 0.1067, 0.0770, 0.1379, 0.1095])
Mi sento triste, mi manca mio figlio tensor([0.0336, 0.1779, 0.0621, 0.1184, 0.0442, 0.5639])

Sub

Ted dataset

Here we show the prediction probabilities for each previous test sentences with Ted's talks model, and its confusion matrix.

print(learn_ted.data.classes,'\n')
for s in test_text[0]:
  res = learn_ted.predict(preprocess(s))
  print(s, res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness'] 

Questo è il giorno più felice della mia vita tensor([0.0087, 0.1474, 0.1952, 0.4646, 0.0576, 0.1264])
Penso di potermela cavare questa volta tensor([0.0156, 0.0770, 0.6636, 0.1331, 0.0602, 0.0506])
L'altro giorno sono andato a mare tensor([0.0101, 0.0346, 0.4815, 0.3753, 0.0555, 0.0431])
Domani andro al mare, che bello tensor([0.0082, 0.0242, 0.5839, 0.3462, 0.0221, 0.0154])
Credo di amarti tensor([0.0043, 0.0394, 0.5940, 0.3241, 0.0286, 0.0096])
Sei una persona orribile tensor([0.0080, 0.0261, 0.5722, 0.3457, 0.0396, 0.0084])
Vorrei viaggiare, sarebbe bellissimo tensor([0.0031, 0.0408, 0.1390, 0.5835, 0.2169, 0.0168])
Non voglio più uscire di casa tensor([0.0039, 0.0251, 0.3116, 0.3779, 0.2718, 0.0098])
La giornata oggi è pessima, non mi va di uscire tensor([0.0127, 0.1391, 0.1608, 0.0785, 0.4944, 0.1144])
Oggi sono malinconico, non mi va di uscire tensor([0.0246, 0.1843, 0.1614, 0.1001, 0.3683, 0.1613])
Mi sento triste, mi manca mio figlio tensor([0.0064, 0.4620, 0.0547, 0.1508, 0.1481, 0.1780])

Ted2

Merged dataset

Here we show the prediction probabilities for each previous test sentences with merged dataset model, and its confusion matrix.

print(learn_merged.data.classes,'\n')
for s in test_text[0]:
  res = learn_merged.predict(preprocess(s))
  print(res[2])
['anger', 'joy', 'neutral', 'optimism', 'pessimism', 'sadness'] 

tensor([0.0037, 0.4122, 0.0781, 0.3479, 0.0423, 0.1158])
tensor([0.0298, 0.1212, 0.2287, 0.5359, 0.0721, 0.0123])
tensor([0.0227, 0.1902, 0.2247, 0.4212, 0.0987, 0.0424])
tensor([0.0670, 0.2664, 0.0882, 0.4957, 0.0360, 0.0468])
tensor([0.0368, 0.2224, 0.1299, 0.5295, 0.0596, 0.0218])
tensor([0.0708, 0.2134, 0.2204, 0.4588, 0.0166, 0.0200])
tensor([0.0037, 0.0731, 0.0312, 0.5713, 0.2942, 0.0265])
tensor([0.1089, 0.0430, 0.0988, 0.2247, 0.4779, 0.0468])
tensor([0.1178, 0.2558, 0.2044, 0.2261, 0.1100, 0.0858])
tensor([0.1887, 0.2757, 0.1958, 0.1567, 0.1086, 0.0745])
tensor([0.0180, 0.3835, 0.0439, 0.0921, 0.0692, 0.3933])

Merged

Language models

Using the language models we trained them how to speak Italian. We have done it through our first language model based on the Wikipedia dataset and then we fine tuned it with our datasets, obtaining the last language model.

The accuracy in percentage of our model is a range of 20. In a way to obtain a good model we need a very large corpus, which we have, but we need more time to obtain the Italian datasets. This is going to be our next work

TED's talks

lm_ted = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Ted2/', 'Language_model.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'Come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7

print('\n\n'.join(lm_ted.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
Come terapie anni si minore di

Come mette anno detto zona ]

Come avanzate il ministro quattro ha

Come della loro 35 da vaccino

Come troppo alternative è un intendo

Come quello un nel molte l'

Come futuro l' fanteria . xxbos

Subtitles

lm_sub = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Sub/', 'Language-model-sub.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7

print('\n\n'.join(lm_sub.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
come la fondo di meta con

come regolamento che la sapevi siete

come e da portarla parecchio per

come via un sto kit xxbos

come sono dolore che dopo mio

come per e settimana a via

come molto ? xxbos contratto xxbos

Merged

lm_merged = load_learner('/content/drive/My Drive/Colab Notebooks (1)/ULMFiT Merged/', 'Language-model-merged.pkl')
# String prefix that has to be used for generated sentences
TEXT = 'come'
# Number of words for each sentence to generate
N_WORDS = 5
# Number of sentences to generate
N_SENTENCES = 7

print('\n\n'.join(lm_merged.predict(TEXT,N_WORDS) for _ in range(N_SENTENCES)))
come margaret ancora spiegare ? xxbos

come ho voglio tutti tutti xxbos

come vai xxbos imparano aveva xxbos

come è questo po nella vita

come non ci noi resto guardie

come su hai organizzare andrebbe ad

come mi togli del costo ?