- The project was presented at 2019 AILC Lectures on Computational Linguistics! The poster created for the presentation is available here.
The purpose of this research project is to analyze the epistolary corpus of Italo Svevo, one of the great italian novelists of the twentieth century and a pioneer of the psychological novel in Italy. The analysis were performed on the corpus created by Cristina Fenu as final project work for the Masters in Digital Humanities of Università Ca' Foscari of Venice during academic year 2015-2016.
Results of the first analysis were published in the proceedings of the 2017 AIUCD Conference and are available on the website of the Svevian Museum of Trieste.
The analysis was structured in two parts:
-
Topic modeling of the italian corpus using latent Dirichlet allocation to extract the main topics contained in the corpus and estimate their association with interlocutors in time.
-
Sentiment analysis of the whole corpus using the Word-Emotion Association Lexicon (EmoLex) by Mohammad & al. to highlight relations between emotive states, topics and interlocutors through time.
This repository is structured as follows:
-
The
datasetsfolder contains the original letter corpus, and is the location where all subsequent datasets used for our purposes are saved. New: Added positive/negative sentiment italian wordlist for recurrent words connotation analysis. -
The
resultsfolder contain plots describing our findings and evaluating the performance of our LDA model in svg and png format. -
The
topic_modelingnotebook contains all the code I used to perform my topic modeling analysis. In the end, it produces asvevo_with_topics.csvfile containing topics assigned to each letter. Only the 500 italian letters with most separated topics are taken into account. -
The
sentiment_analysis_extractionnotebook generates asentiment.csvfile containing the sentiment intensity percentage for all the letters in the original corpus. -
The
sentiment_analysis_evaluationnotebook creates many additional datasets used to evaluate and plot our results. -
New: The
recurrent_words_connotation_analysisnotebook is used to inspect which words are the cause of most positive/negative sentiment over Svevo lifespan. -
The
future_perspectivesnotebook contains approaches that were tested for the analysis and finally disregarded for their complexity or their results, but definitely deserve a second look for future utilization.
In order for all the notebooks to work properly, the following requirements should be met.
Warning: The last part of future perspective notebook will not function out-of-the-box. See additional requirements below and notebook for more information on this topic.
numpypandasgensimspacysklearnpyLDAvismatplotlibseaborntqdm
Simply run pip install -r requirements.txt inside this folder to automatically install all dependencies.
For the spacy package, the languages should be installed as follows:
python -m spacy download en
python -m spacy download fr
python -m spacy download it
python -m spacy download de
syuzhetdplyrpander
Run install.packages("syuzhet", "dplyr", "pander") inside a R shell.
The set of italian embeddings necessary to test the Word2Vec approach in future_perspectives is available on the Italian NLP Lab website.
A short report of the research project has been updated and is available! The last section contains a textual description of our findings. For a visual understanding, please refer to the results folder.