# Labeling analysis

The main idea is that we labeled tweets for a supervised ML algo (which we didn't do, changed methodology in between). However, we still use our discoveries as a qualitative analysis of the dataset, allowing us to have a better approach of the data we are working with.

## First labeling :

The 1st labeling we did was on a dataset that was already processed (cleaning + NLP + cross-validation with the dictionaries). This task is complicated as it is very subjective to choose tweets displaying mental health issues. During this labeling, we understood the necessity to "expand" our search to tweets showing signs of mental distress in general. 

//TODO : add summer takeaway

### Key (qualitative) takeaways :

- Tweets were labeled "losely" using multiple signs of mental distress, mostly sadness. Thus, tweets with the following specificities were labeled as showing mental distress: nostalgia (either for the past or simply for the end of a nice day), mention of sad activities (people who watched sad movies, listened to sad songs as this was most probably triggered by a previous feeling of sadness) or mention of being mad (either over other people like haters or, more often, mad with their families)
- Some tweets mentioned the mental distress of other people (either how they helped them or raising awareness over these issues)
- Some tweets showed desperation (a form of mental distress) over the "way the world is", these were not included.
- Some tweets only included _"motivational"_ quotes. We can suppose that these reflect bad times. Nonetheless, these were not included as our previous assumption is a little far-fetched. Moreover, this could have messed with our model (they did not include specific words of our dictionary nor specific mentions to mental distress).
- A lot of mental distress tweets are related to unilateral feelings. They are often messages specifically targeted at someone who most certainly doesn't know their existence (for example work collegues or college classmates).
- Most tweets talking about depression or terrible low self-esteem show _« covert-humour »_, a coping technique used to hide ones self-deprecation. These tweets are very difficult to detect as they use sarcasm (which can not be perceived using current NLP techniques) and seldom use words from or dictionary.

- Maybe include « unfortunately » in the dict ? Maybe « therapy » ?
- Include « overthinking » in dict
- Summer often means hurt (people are away like friends, maybe from uni or smthg)

### Key (quantitative) takeaways :

#### Possible dictionary modification  :

Many words were not previously included in our dictionary but signaled tweets that clearly showed mental distress. 3 possible additions are the word _unfortunately_, _therapy_ and _overthinking_. As we can see, these words do not appear often but unlike other expressions, they are unique to tweets showing mental distress.

In [None]:
import pandas as pd

first_iter = pd.read_csv("data/labeled_tweets/english_labeled.csv")
first_iter.head()

In [None]:
first_iter[first_iter['main'].map(lambda x: 'unfortunately' in x)]

In [None]:
first_iter[first_iter['main'].map(lambda x: 'therapy' in x)]

In [None]:
first_iter[first_iter['main'].map(lambda x: 'overthinking' in x)]

On the other hand, we have many words in our dictionary who maybe shouldn't be used. When looking at their appearance, we can see that they often show false positives. These words are : _hate_, _hurt_, _pain_, _addict_ and _overdose_

In [None]:
hate_count = first_iter[first_iter['split'].map(lambda x: 'hate' in x)]['split'].count()
print("Total appearance of 'hate': ", hate_count)
pos_hate_count = first_iter[first_iter['split'].map(lambda x: 'hate' in x) & first_iter['mental'] == 1]['split'].count()
print("Tweets showing mental distress : ", pos_hate_count)
print("Ratio : ", pos_hate_count/hate_count)

In [None]:
hurt_count = first_iter[first_iter['split'].map(lambda x: 'hurt' in x)]['split'].count()
print("Total appearance of 'hurt': ", hurt_count)
pos_hurt_count = first_iter[first_iter['split'].map(lambda x: 'hurt' in x) & first_iter['mental'] == 1]['split'].count()
print("Tweets showing mental distress : ", pos_hurt_count)
print("Ratio : ", pos_hurt_count/hurt_count)

In [None]:
pain_count = first_iter[first_iter['split'].map(lambda x: 'pain' in x)]['split'].count()
print("Total appearance of 'pain': ", pain_count)
pos_pain_count = first_iter[first_iter['split'].map(lambda x: 'pain' in x) & first_iter['mental'] == 1]['split'].count()
print("Tweets showing mental distress : ", pos_pain_count)
print("Ratio : ", pos_pain_count/pain_count)

In [None]:
addict_count = first_iter[first_iter['split'].map(lambda x: 'addict' in x)]['split'].count()
print("Total appearance of 'addict': ", addict_count)
pos_addict_count = first_iter[first_iter['split'].map(lambda x: 'addict' in x) & first_iter['mental'] == 1]['split'].count()
print("Tweets showing mental distress : ", pos_addict_count)
print("Ratio : ", pos_addict_count/addict_count)

In [None]:
overd_count = first_iter[first_iter['split'].map(lambda x: 'overd' in x)]['split'].count()
print("Total appearance of 'overdose': ", overd_count)
pos_overd_count = first_iter[first_iter['split'].map(lambda x: 'overd' in x) & first_iter['mental'] == 1]['split'].count()
print("Tweets showing mental distress : ", pos_overd_count)
print("Ratio : ", pos_overd_count/overd_count)

When looking at the number, we see that more than half tweets containing these words are false positives, which pushes us to delete these words (while the words we want to add are very specific). For reasons we will see, below, the word _asylum_ should also be striked from our dictionary.

#### Negative dictionary creation :

On top of having a dictionary of words which need to be included, some words need to be explicitly taken out as they represent topics that have nothing to do with mental distress. Most of these topics are specific to the year 2016 as they mention memorable events that happened.

8 words in particular strike us : _overrated_, _india_, _overrated_, _gov_, _government_, _overwatch_, _vine_, _trump_, _hillary_. Moreover, another topic that should be striked out is Syria, topic containing itself many words: _syria_, _assad_, _refugee_ and so on. The word _asylum_ is solely used in this context and never in the context of mental health.

_Note:_ as there are many words, we will only highlight 2 or 3 words to show how often they appear.

In [None]:
print("Total appearance of 'overwatch': ", first_iter[first_iter['split'].map(lambda x: 'overr' in x)]['split'].count())

In [None]:
print("Total appearance of 'gov': ", first_iter[first_iter['split'].map(lambda x: 'gov' in x)]['split'].count())

In [None]:
print("Total appearance of 'overwatch': ", first_iter[first_iter['split'].map(lambda x: 'syria' in x or 'assad' in x)]['split'].count())

Another issue we encountered (which was due to our algorithm) was words containing parts of our dictionary words. The 2 main words which appeared in this context were _suicidesquad_ (containing _suicide_) and _spain_ (containing _pain_). Thus, we also decide to include them in the _"negative"_ (blacklist) dictionary.

## Second labeling :

This second analysis has been done on a different dataset. Instead of labeling only tweets previously labeled as "NEUTRAL" and "NEGATIVE", we apply our dictionary to the whole dataset to have more relevant results (and show that Spinn3r's labeling algorithm does not work as expected, even though only 1.4% of the tweets in our sample dataset are mislabeled).

In [None]:
import pandas as pd

second_iter = pd.read_csv("data/labeled_tweets/new_english_labeled.csv")
second_iter.head()

In [None]:
difference = second_iter[second_iter['sentiment'].map(lambda x: 'POSITIVE' in x) & second_iter['mental'] == 1]['main'].count()
print("Difference: ", difference)

### Key (qualitative) takeaways :

Unlike the first pass on the data, this labeling does not include a thorough qualitative analysis. It only allowed us to improve our methodology and define open questions (one of which is the difference in emotional expression between the genders).

// TODO: add stuff about fangirls (explains why more women talk )

### Key (quantitative) takeaways :

Once again, we define 2 categories of keywords : those that need to be added to our dictionary and those who should be added to a "blacklist" .

#### Dictionary additions :

A very long list of words has been defined : _panic attack_, _sleepless_, _problems falling asleep_, _lonely and sad_, _confused and sad_, _gambling_, _breakdown_, and _mental issue_. As this is a very long list, we will only quantify some of them to show how important they are.

In [None]:
total_bd = second_iter[second_iter['main'].map(lambda x: 'breakdown' in x or 'break down' in x)]['main'].count()
print("Total appearance of 'breakdown': ", total_bd)

In [None]:
total_bd = second_iter[second_iter['main'].map(lambda x: ('confused' in x or 'lonely' in x) and 'sad' in x)]['main'].count()
print("Total appearance of the expressions involving 'sad' : ", total_bd)

#### Blacklist : 

Once again, we found that the word _asylum_ and the words with the radical _"addict"_ should not be included in our dictionary. Moreover, we found that tweets with the following words should not be included in our final set : _sadly_, _not afraid_, _stress relief_ and _Sonic Mania_. Once again, we will show our findings using 2 words.

In [None]:
sadly_count = second_iter[second_iter['main'].map(lambda x: 'sadly' in x)]['main'].count()
print("Total appearance of 'sadly': ", sadly_count)
pos_sadly_count = second_iter[second_iter['main'].map(lambda x: 'sadly' in x) & second_iter['mental'] == 1]['main'].count()
print("Tweets showing mental distress : ", pos_sadly_count)
print("Ratio : ", pos_sadly_count/sadly_count)

In [None]:
sm_count = second_iter[second_iter['main'].map(lambda x: 'sonic' in x and 'mania' in x)]['main'].count()
print("Total appearance of 'sadly': ", sm_count)
pos_sm_count = second_iter[second_iter['main'].map(lambda x: 'sonic' in x and 'mania' in x) & second_iter['mental'] == 1]['main'].count()
print("Tweets showing mental distress : ", pos_sm_count)
print("Ratio : ", pos_sm_count/sm_count)

Even though this second expression is not very used, it can be affiliated with words previously seen as being "contextual" (to the year 2016). Another explanation is the presence of the word _'Sonic'_ along with _'mania'_, which is a word necessary to include in our dictionary.