# Sentiment Extractor

> Returns sentiments from audio files

This notebook makes use of a transformer model to extract sentiments for earnings calls. The model used is a pre-trained model available on huggingface: [hubert-large-superb-er](https://huggingface.co/superb/hubert-large-superb-er). The earnings call data used for this project was collected from [earningscall.biz](https://earningscall.biz/).

GPU should be used for this notebook to ensure the quickest results for sentiment extraction using the transformer.

## Package Installation

The following packages are necessary for execution of this notebook:

In [None]:
!pip install transformers
!pip install mutagen

## Data Loading

The following code chunks mount the google drive, set the necessary file path, and read in earnings call audio data prior to sentiment extraction.

***This should be altered based on user settings***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
cur_path = "/content/drive/My Drive/earnings-call-audio-modeling/"
os.chdir(cur_path)
!pwd

/content/drive/My Drive/earnings-call-audio-modeling


In [None]:
import pandas as pd
data = pd.read_csv('data.csv')

In [None]:
os.chdir('audio-files')
!pwd

/content/drive/My Drive/earnings-call-audio-modeling/audio-files


## Define function to extract sentiments from audio data

The function takes the inputs of audio file, duration, and sample rate to generate sentiments, which are output for the given audio file. 

The function first takes the length of the audio in seconds and then initializes start points at intervals based on the duration. The default duration for this function is 5, because the dataset used for pre-training of the transformer had average length between 4 and 5 seconds for each audio utterance ([source](https://sail.usc.edu/iemocap/Busso_2008_iemocap.pdf)). 

The load function from the librosa library converts the audio file into a numeric array before the data can be passed into the transformer pipeline. The default sample rate is set to 16000 for the librosa.load function, as this was used for the model pretraining.

The transformer pipeline is then defined for our audio classification task of sentiment analysis, with a specification to utilize available GPU's (as device is set to 0). An empty list is next initialized for storage of sentiments. 

Looping over the set of defined start points, sentiment analysis is conducted on each audio chunk. Sections of the array developed by librosa (defined as y in the function) are put through the transformer pipeline to output a dictionary of probabilities for each sentiment (happy, angry, sad, or neutral). The dictionary of probabilities is appended to the sentiments list and the process repeats until all chunks of the audio's tensor have been classified. 

Finally, the remaining code converts our sentiment list of dictionaries into a dataframe with columns for each sentiment score, in order of the earliest to latest audio chunks. 

***The current function does not follow best coding practices and currently takes hours to run for our entire 2021 Q4 dataset. For quick results, the function should remove the librosa.load function. Instead, parallel processing should be used on librosa.load for each file of interest to develop a dataset of arrays. These arrays could be directly loaded into the pipeline and significantly reduce the runtime for future use. This is documented in issue #(fill this in).***

In [None]:
# libraries to deal with audio file
from mutagen.mp3 import MP3
import librosa

# libraries for modeling
from transformers import pipeline

# other general libraries
import math
from itertools import chain

In [None]:
def get_sentiments(audio_file, duration=5, sample_rate=16000):
  file_length = MP3(audio_file).info.length
  start_points = [i*sample_rate*duration for i in range(math.floor(file_length/duration))]

  y, _ = librosa.load(audio_file, sr=sample_rate)
  classifier = pipeline("audio-classification", model="superb/hubert-large-superb-er", device=0)
  sentiments = []
  for i in start_points:
    labels = classifier(y[i:i+(sample_rate*duration)])
    sentiments.append(labels)
  sentiment_df = pd.DataFrame(list(chain.from_iterable(sentiments)))
  sentiment_df['utterance'] = [math.floor(i/4) for i in pd.DataFrame(list(chain.from_iterable(sentiments))).index]
  output_df = sentiment_df.pivot(index='utterance', columns='label', values='score').reset_index(drop=True).rename_axis(None, axis=1)

  return output_df

In [None]:
data

Unnamed: 0,date,ticker,quarter,year,file
0,19-Jan,AA,Q4,2021,AA Q4 2021.mp3
1,20-Jan,AAL,Q4,2021,AAL Q4 2021.mp3
2,26-Jan,ABT,Q4,2021,ABT Q4 2021.mp3
3,28-Jan,ABTX,Q4,2021,ABTX Q4 2021.mp3
4,25-Jan,ADM,Q4,2021,ADM Q4 2021.mp3
...,...,...,...,...,...
188,27-Jan,WHR,Q4,2021,WHR Q4 2021.mp3
189,27-Jan,WRB,Q4,2021,WRB Q4 2021.mp3
190,26-Jan,WSBC,Q4,2021,WSBC Q4 2021.mp3
191,27-Jan,XEL,Q4,2021,XEL Q4 2021.mp3


## Extract Sentiment Scores

The function is looped for each audio file within our dataset. The resulting dataframe contains five columns: company and four emotion probability columns.

In [None]:
all_df = pd.DataFrame(columns = ['company', 'ang', 'hap', 'neu', 'sad'])
for index, row in data.iterrows():
  curr_df = get_sentiments(row['file'])
  curr_df.insert(0, 'company', row['ticker'])
  all_df = all_df.append(curr_df, ignore_index=True)
all_df

Below, we insert a column for timestamps, representing the starting time (in seconds) of each chunk (represented by a row in the dataset).

***This should be done as a part of our get_sentiments function and has been added as an issue (see issue #add an issue).***

In [None]:
all_df.insert(1, 'timestamp', all_df.groupby(['company']).cumcount()*5)

In [None]:
all_df

Unnamed: 0,company,timestamp,ang,hap,neu,sad
0,AA,0,0.052980,0.553532,0.392309,0.001179
1,AA,5,0.001278,0.314281,0.679903,0.004538
2,AA,10,0.026318,0.109212,0.863732,0.000738
3,AA,15,0.038293,0.563405,0.389889,0.008412
4,AA,20,0.054359,0.337466,0.605569,0.002606
...,...,...,...,...,...,...
123648,XSPA,3415,0.028637,0.403518,0.470089,0.097756
123649,XSPA,3420,0.097064,0.164416,0.657496,0.081025
123650,XSPA,3425,0.039409,0.079910,0.869063,0.011618
123651,XSPA,3430,0.071282,0.430433,0.456526,0.041759


## Export final sentiments file

This file will support future analysis

In [None]:
os.chdir('..')

In [None]:
all_df.to_csv('audio_sentiment.csv', index = False)