# Explicit Content in Laureates of 2020 Wattys Award

Wattpad is a website where anyone can share their own stories. Although a lot of nowadays popular writers started on Wattpad (and who have written generally good fiction), it still has a reputation of a very cringe place full of sexually oriented fanfictions. In the Czech republic, we even have a Facebook page called [Úlovky z Wattpadu](https://www.facebook.com/ulovkyzwattpadu) which shares snippets of the worst stories written in Czech (but it's really funny if you don't have to read the whole story).

I decided to check 30 stories which won 2020 Wattys Award to see if it is true that Wattpad stories generally lack quality and are full of sexual or explicit content.


## Getting the data

I used my own [fork](https://github.com/AiKuroyake/Wattpad2Epub) of [Wattpad2Epub](https://github.com/GatoLoko/Wattpad2Epub) script. Wattpad2Epub allows you download Wattpad stories easily and save them in epub format. I have done some minor changes to the code, that's why I used my version.

After downloading 30 stories from [The Wattys 2020 Award Winners](https://www.wattpad.com/list/996419659-the-2020-wattys-award-winners), I needed to convert epub format of the stories to txt, so I can process it with Python. I used [Calibre](https://calibre-ebook.com/) for this task.

## Cleaning the data

Every book starts with Introduction containing synopsis, chapter list, and info about the book. I wanted as clean data as I could so I decided to delete these. Unfortunately, every book contains a bit different Introduction, so I had to go over them manually.

I wanted to remove authors' notes from the books, too. There was the same problem as with introduction. I had to delete the notes manually. Unfortunately, after scrolling through few stories, I found out there are authors' notes in between chapters. Not in all stories and not after every chapter. I left these notes there.

Almost every story contains images. A following line is displayed instead of images: _Oops! This image does not follow our content guidelines. To continue publishing, please remove it or upload a different image._ I removed it using replace() method when reading the file.

I did not remove _Chapter 1_ or similar headings. I also left numbers, stop words, and other tokens in the texts. These tokens will not spoil my results in no way.

## Preparing the data

I want to work with a dictionary, where keys are titles of the stories and values are the stories itself. 

I need to split each story into paragraphs and convert each paragraph into an nlp object. The problem is converting paragraphs into nlp objects takes a lot of time. And it is incovenient to wait half an hour every time I restart the Jupyter Notebook to get the data.

That's why I use `save_docs()` function to store converted nlp objects for all books and `load_docs()` function to load them quickly and easily.

In [1]:
import spacy
from spacy.tokens import DocBin
nlp = spacy.load('en_core_web_sm')

In [130]:
from collections import defaultdict

In [5]:
def save_docs(docs, filename):
    '''Save data to a file'''
    # list of attributes I need; can be found on https://spacy.io/api/matcher
    doc_bin = DocBin(attrs=["SENT_START","LEMMA",  "ENT_TYPE", "POS", "DEP"])
    for doc in docs:
        doc_bin.add(doc)
    with open(filename, 'ab+') as fp:
        fp.write(doc_bin.to_bytes())

In [2]:
def load_docs(filename, nlp):
    '''Load data from a file'''
    doc_bin = None
    with open(filename, 'rb') as fp:
        doc_bin = DocBin().from_bytes(fp.read())
    return list(doc_bin.get_docs(nlp.vocab))

In [3]:
def read_books(story):
    '''Reads a text divided into paragraphs.'''
    pars = []
    with open(story) as s:
        text = s.read()
        text = text.replace('Oops! This image does not follow our content guidelines. To continue publishing, please remove it or upload a different image.', '')
        pars = [p.replace('\n', ' ') for p in text.split('\n') if p.strip()]
    return pars

In [4]:
def convert_to_nlp(text):
    '''Converts paragraphs into nlp objects'''
    docs = [nlp(par) for par in text]
    return docs

<p>I hardcoded the stories' names into a variable. It would be more convenient to iterate over the whole data folder and pick one book at time, but I got errors while trying to do that.</p>

In [54]:
stories = ['data/a-bear-in-sheeps-clothing.txt', 
           'data/a-fantasy-real.txt', 
           'data/a-is-for-arson.txt', 
           'data/a-timely-knight.txt', 
           'data/a-twist-of-marvel-infinity-war.txt', 
           'data/bending-the-rules.txt',
           'data/breaking-darkness.txt', 
           'data/charlie-and-dia.txt', 
           'data/comfort-the-wolves.txt',
           'data/dark-side-of-the-morning.txt', 
           'data/for-june.txt', 
           'data/how-to-be-the-best-third-wheel.txt',
           'data/how-to-lose-weight.txt', 
           'data/human-code.txt', 
           'data/inspector-rames.txt',
           'data/jackson-humes-is-not-a-superhero.txt',
           'data/oliver-ausman-lives-again.txt',
           'data/parasomnia.txt',
           'data/running-from-the-past.txt', 
           'data/t-m-i.txt', 
           'data/the-devils-match.txt', 
           'data/the-mosaic-in-her-eyes.txt', 
           'data/the-night-the-vampires-came.txt', 
           'data/the-omen-girl.txt',
           'data/the-painted-altair.txt',
           'data/the-psychopath-next-door.txt',
           'data/valeria-torres.txt',
           'data/we-the-young.txt',
           'data/winners-dont-have-bad-days.txt', 
           'data/zombie-soap.txt']

Split each story into paragraphs. Convert each paragraph into an nlp object. Save the nlp object of each story in a separate file. 

In [55]:
for story in stories:
    print('Reading {}...'.format(story))
    text = read_books(story)
    print('Converting {}...'.format(story))
    nlp_text = convert_to_nlp(text)
    print('Saving {} to {}.spacy...'.format(story, story[5:-4]))
    save_docs(nlp_text, '{}.spacy'.format(story[5:-4]))
    print()

Reading data/a-bear-in-sheeps-clothing.txt...
Converting data/a-bear-in-sheeps-clothing.txt...
Saving data/a-bear-in-sheeps-clothing.txt to a-bear-in-sheeps-clothing.spacy...

Reading data/a-fantasy-real.txt...
Converting data/a-fantasy-real.txt...
Saving data/a-fantasy-real.txt to a-fantasy-real.spacy...

Reading data/a-is-for-arson.txt...
Converting data/a-is-for-arson.txt...
Saving data/a-is-for-arson.txt to a-is-for-arson.spacy...

Reading data/a-timely-knight.txt...
Converting data/a-timely-knight.txt...
Saving data/a-timely-knight.txt to a-timely-knight.spacy...

Reading data/a-twist-of-marvel-infinity-war.txt...
Converting data/a-twist-of-marvel-infinity-war.txt...
Saving data/a-twist-of-marvel-infinity-war.txt to a-twist-of-marvel-infinity-war.spacy...

Reading data/bending-the-rules.txt...
Converting data/bending-the-rules.txt...
Saving data/bending-the-rules.txt to bending-the-rules.spacy...

Reading data/breaking-darkness.txt...
Converting data/breaking-darkness.txt...
Savin

In data_spacy folder, I have collected all stories split into paragraphs and converted to nlp objects. 

I created another variable called `nlp_stories` where I store all stories save as .spacy files. I hardcoded these names, too, because iterating over a folder gave me errors.

In [6]:
nlp_stories = ['data_spacy/a-bear-in-sheeps-clothing.spacy', 
               'data_spacy/a-fantasy-real.spacy', 
               'data_spacy/a-is-for-arson.spacy', 
               'data_spacy/a-timely-knight.spacy', 
               'data_spacy/a-twist-of-marvel-infinity-war.spacy', 
               'data_spacy/bending-the-rules.spacy',
               'data_spacy/breaking-darkness.spacy', 
               'data_spacy/charlie-and-dia.spacy', 
               'data_spacy/comfort-the-wolves.spacy',
               'data_spacy/dark-side-of-the-morning.spacy', 
               'data_spacy/for-june.spacy', 
               'data_spacy/how-to-be-the-best-third-wheel.spacy',
               'data_spacy/how-to-lose-weight.spacy', 
               'data_spacy/human-code.spacy', 
               'data_spacy/inspector-rames.spacy',
               'data_spacy/jackson-humes-is-not-a-superhero.spacy',
               'data_spacy/oliver-ausman-lives-again.spacy',
               'data_spacy/parasomnia.spacy',
               'data_spacy/running-from-the-past.spacy', 
               'data_spacy/t-m-i.spacy', 
               'data_spacy/the-devils-match.spacy', 
               'data_spacy/the-mosaic-in-her-eyes.spacy', 
               'data_spacy/the-night-the-vampires-came.spacy', 
               'data_spacy/the-omen-girl.spacy',
               'data_spacy/the-painted-altair.spacy',
               'data_spacy/the-psychopath-next-door.spacy',
               'data_spacy/valeria-torres.spacy',
               'data_spacy/we-the-young.spacy',
               'data_spacy/winners-dont-have-bad-days.spacy', 
               'data_spacy/zombie-soap.spacy']

I iterate over `nlp_stories` to create a dictionary consisting of a title (key) and the story itself (value). I use the `load_docs()` function to quickly access the preprocessed data. 

In [29]:
def stories_dict(nlp_stories):
    '''Create a dictionary of stories'''
    docs = {}
    for story in nlp_stories:         # for each story
        doc = load_docs(story, nlp)   # load the preprocessed .spacy file
        title = str(doc[0][2:])       # get the story's title (and omit 'Title' and ':' tokens)
        docs[title] = doc[1:]         # save the story into a dictionary, omit the title from the story
    return docs

In [30]:
docs = stories_dict(nlp_stories)

## Finding Explicit Content

At first, I wanted to explore common topics in our stories. However, this dataset is very small, so I did not get any relevant data. To start exploring our stories, we can start by simply searching for a word 'sex' and then decide our further steps based on results we get.

In [81]:
def containing_sex(docs):
    stories = []
    for key, val in docs.items():
        for par in val:
            for tok in par:
                if tok.lemma_ == 'sex':
                    if key not in stories:
                        stories.append(key)
    return stories

cs = containing_sex(docs)
print(len(cs))

23


Out of 30 stories, 23 contain the word 'sex'.

The following stories have at least one occurence of the word 'sex':

In [100]:
[s for s in cs]

["A Bear in Sheep's Clothing | Book #1",
 'A Fantasy Real',
 'A Timely Knight',
 'Bending the Rules',
 'Breaking Darkness',
 'Comfort the Wolves',
 'Dark Side of the Morning',
 'for June',
 'How To Be The Best Third Wheel ✔',
 'How To Lose Weight And Survive The Apocalypse',
 'Inspector Rames',
 'Oliver Ausman Lives Again',
 'Parasomnia',
 'RUNNING FROM THE PAST',
 'T.M.I.',
 "The Devil's Match",
 'The Mosaic in Her Eyes',
 'The Night the Vampires Came',
 'The Painted Altar',
 'The Psychopath Next Door',
 'Valeria Torres and the Midas Vault',
 'We The Young',
 'Zombie Soap']

Whereas these stories do not contain the word 'sex':

In [142]:
no_sex = [key for key in docs.keys() if key not in cs]

In [143]:
no_sex

['A is For Arson: A Langley & Porter Mystery',
 'A Twist Of Marvel Infinity War',
 'Charlie and Dia',
 'Human Code',
 'Jackson Humes is Not a Superhero',
 'THE OMEN GIRL',
 "Winners Don't Have Bad Days"]

However, we cannot assume a story contains sexual scenes just based on an occurence of the word 'sex'. Even stories where the word 'sex' does not occur still can have innapropriate scenes.

We can extend our vocabulary and use more words, than just a 'sex'. I picked 33 words from a "sex vocabulary" list I directly found listed on Wattpad. We are going to search for lemmas of these words and count their occurences. Then we are going to sort our results according to the occurences. We are interested in number of occurences only, not in a variety of words a story contains.

What are the results now?

In [157]:
def read_words(file):
    '''Read a list of innapropriate words'''
    with open(file) as f:
        f = f.readlines()
        f = [w.rstrip() for w in f] # delete newline character
    return f

In [172]:
s_words = read_words('s_words.txt')

In [173]:
def containing_more(docs, s_words):
    more = defaultdict(lambda: 0)
    for key, val in docs.items():
        for par in val:
            for tok in par:
                if tok.lemma_ in s_words:
                    more[key] += 1
    return more
cm = containing_more(docs, s_words)
print(len(cm))

30


All of the 30 stories contain something from our vocabulary. 

Let's rank the stories based on the occurences of words and see where the 7 stories not containing word 'sex' appear.

In [174]:
cm_sorted = sorted(cm.items(), reverse=True, key=itemgetter(1))

In [175]:
for i, story in enumerate(cm_sorted):
    print(i, story)

0 ('A Fantasy Real', 346)
1 ("A Bear in Sheep's Clothing | Book #1", 229)
2 ('T.M.I.', 227)
3 ('Bending the Rules', 205)
4 ('The Mosaic in Her Eyes', 188)
5 ('How To Be The Best Third Wheel ✔', 155)
6 ('RUNNING FROM THE PAST', 148)
7 ('for June', 133)
8 ('How To Lose Weight And Survive The Apocalypse', 123)
9 ('The Psychopath Next Door', 119)
10 ("The Devil's Match", 114)
11 ('Comfort the Wolves', 109)
12 ('Inspector Rames', 96)
13 ('We The Young', 91)
14 ('A Timely Knight', 89)
15 ('Zombie Soap', 86)
16 ('Dark Side of the Morning', 76)
17 ('Breaking Darkness', 65)
18 ("Winners Don't Have Bad Days", 60)
19 ('Human Code', 41)
20 ('A Twist Of Marvel Infinity War', 39)
21 ('Charlie and Dia', 38)
22 ('Parasomnia', 37)
23 ('The Night the Vampires Came', 30)
24 ('Valeria Torres and the Midas Vault', 29)
25 ('Oliver Ausman Lives Again', 26)
26 ('THE OMEN GIRL', 21)
27 ('Jackson Humes is Not a Superhero', 19)
28 ('The Painted Altar', 12)
29 ('A is For Arson: A Langley & Porter Mystery', 11)


It is interesting that the 7 stories not containing word 'sex' do not appear at the end of the list. To be exact, **Winners Don't Have Bad Days** scored 17th with 60 occurences of words from our list. Followed by **Human Code** (18th) with 41 occurences. **A  Twist of Marvel Infinity War** follows with 39 occurences and **Charlie and Dia** with 38. At the end, we have **The Omen Girl** with 21 occurences, **Jackson Humes is Not a Superhero** with 19 occurences, and **A is for Arson: A Langley & Porter Mystery** with 11. Although the stories do not appear in first 10 most explicit stories, **Winners Don't Have Bad Days** and **Human Code** ended quite high in the list.

The problem can be in words I chose to search for. I included words which in the end do not necessary describe anything sexual. For example, a word 'fuck' can be used as a swear word, or if someone 'sucks' does not mean they are involved in a sexual act. 

Let's iterate over the 7 stories not containing 'sex' once more and see what lemmas occur in these stories and what the sentences where these lemmas occur look like.

In [176]:
def no_sex_words(docs, no_sex, s_words):
    for key, val in docs.items():
        if key in no_sex:
            for par in val:
                for tok in par:
                    if tok.lemma_ in s_words:
                        print(key, '\t', tok.lemma_, '\t', tok.sent)
            print()
    return 0
nsw = no_sex_words(docs, no_sex, s_words)



A is For Arson: A Langley & Porter Mystery 	 suck 	 The overweight policeman was red faced and out of breath and he took a moment to double over, hands on his knees, and suck in a few breaths before he looked up to us.
A is For Arson: A Langley & Porter Mystery 	 suck 	 "I got into gambling again," he admitted and the air was sucked right out of the room.
A is For Arson: A Langley & Porter Mystery 	 kiss 	 I rose from the table then and kissed Liza, then father, on top of their heads.
A is For Arson: A Langley & Porter Mystery 	 kiss 	 I rolled my eyes and kissed my sister on the cheek before heading for the door.
A is For Arson: A Langley & Porter Mystery 	 kiss 	 I extended a hand to shake his but he placed a tender kiss upon mine instead.
A is For Arson: A Langley & Porter Mystery 	 kiss 	 He took my hand and kissed it for so long that Mr. Langley had to clear his throat before he would return his gaze to him though not before gazing up at me from beneath those thick lashes of his

Human Code 	 fuck 	 "The fuck is going on?
Human Code 	 fuck 	 "The fuck is going on?
Human Code 	 kiss 	 But it wasn't a sweet kiss; it was sad.
Human Code 	 kiss 	 But it wasn't a sweet kiss; it was sad.
Human Code 	 fuck 	 Get the fuck away from her!
Human Code 	 fuck 	 Get the fuck away from her!
Human Code 	 suck 	 I sucked in a breath, a hiss, and squeezed my eyes shut for only a second.
Human Code 	 fuck 	 Fuck me... shit.
Human Code 	 fuck 	 Fuck me... shit.
Human Code 	 kiss 	 He kissed my cheek.
Human Code 	 kiss 	 Kissed my cheek.
Human Code 	 kiss 	 be Kissed my cheek.
Human Code 	 suck 	 Mary's words slid over my eyes as I sucked in a breath.
Human Code 	 suck 	 Shaking my head, I tried to suck in a breath and calm myself.
Human Code 	 suck 	 I was worried that..." She sucked in a deep breath before giving me a weak smile.
Human Code 	 suck 	 He leaned forward, hands on his knees, sucking in deep, rough breaths as he tried to look at me.
Human Code 	 suck 	 I finally looke

**A is For Arson: A Langley & Porter Mystery** contains 11 occurences of words from our list, but they do not describe anything sexual at all.

**A Twist of Marvel Infinity War** contains 39 occurences of words from our list. This story contains kissing scenes, but again, nothing innapropriate at all.

**Charlie and Dia** contains 38 occurences of words from our list. Again, the worst sentence reads: "Not knowing where his confidence came from, he kissed the top of her head, leading her to the sofa, as he allowed her to curl into him."

Although **Human Code** contains 41 occurences of words from our list, they are used as swear words.

**Jackson Humes is not a Superhero** contains 19 occurences of words from our list. The most explicit sentence reads: "I felt like an epic ass."

**The Omen Girl** scored with 21 occurences of words. The worst sentence sounds: "He zips it up with her inside, and they kiss each other, and laugh." 

Last, **Winners Don't Have Bad Days** has 60 occurences of words from our vocabulary scoring 18th in our score list. Again, nothing very serious going on here.

### Conclusion?

Out of 30 stories, 23 contained word 'sex'. We looked closer at the 7 stories which did not contain it to see if there are any other words related to sex which would indicate inappropriate scenes despite the fact sex is not mentioned explicitly. After inspecting the stories more, we confirmed the 7 stories do not contain anything inappropriate (based on our vocabulary).

But what about the rest?

Let's see the rankings once more.

- 1 A Fantasy Real `m`
- 2 A Bear in Sheep's Clothing | Book #1"
- 3 T.M.I.
- 4 Bending the Rules `m`
- 5 The Mosaic in Her Eyes' `m`
- 6 How To Be The Best Third Wheel ✔
- 7 RUNNING FROM THE PAST `m`
- 8 for June
- 9 How To Lose Weight And Survive The Apocalypse
- 10 The Psychopath Next Door `m`
- 11 The Devil's Match
- 12 Comfort the Wolves `m`
- 13 Inspector Rames
- 14 We The Young
- 15 A Timely Knight
- 16 Zombie Soap `m`
- 17 Dark Side of the Morning `m`
- 18 Breaking Darkness `m`
- 19 Winners Don't Have Bad Days
- 20 Human Code
- 21 A Twist Of Marvel Infinity War
- 22 Charlie and Dia
- 23 Parasomnia `m`
- 24 The Night the Vampires Came
- 25 Valeria Torres and the Midas Vault
- 26 Oliver Ausman Lives Again
- 27 THE OMEN GIRL
- 28 Jackson Humes is Not a Superhero
- 29 The Painted Altar
- 30 A is For Arson: A Langley & Porter Mystery

Out of 30 stories, 10 are marked as _mature_ on Wattpad (marked in the list, too). It is a similar situation with these as with the 7 stories not containg 'sex' - they did not rank all at the first 10 positions. **A Fantasy Real** labeled mature, "correctly" occupies the first place in our list. Then, we have places 4, 5 and 7. Another one takes 10th place, but the rest is latter in the list. Notice **Parasomnia** as 23rd story.

On the other hand, **A Bear in Sheep's Clothing** ended second. It is not labeled as mature story, it has 'gay', 'boyslove' and similar tags and contain sexual scenes. This also applies to **for June** (but with different tags such as 'lovestory', 'romance', 'drama', and 'relationship-complicated'), and others. 

Label mature does not mean the story contains sexual scenes. For example, one of the mature-labeled stories, **Parasomnia**, ranked 23rd because it is not a romantic story. But, **A Bear in Sheep's Clothing** made it to the 2nd place because it contains explicit scenes despite it is not labeled as mature.

On the other hand, I am confused by **T.M.I.** ranking 3rd. It should be a mystery thriller story. Let's explore it more:


In [194]:
for key, val in docs.items():
    if key == 'T.M.I.':
        for par in val:
            for tok in par:
                if tok.lemma_ in s_words:
                    print(key, tok.lemma_, '\t', tok.sent)

T.M.I. suck 	 "You sucked his dick!"
T.M.I. dick 	 "You sucked his dick!"
T.M.I. suck 	 "You sucked his dick!"
T.M.I. dick 	 "You sucked his dick!"
T.M.I. oral 	 Whether she'd really gone into the boys' locker room and provided oral services to the entire football team, like the rumor claimed, was irrelevant.
T.M.I. suck 	 I sucked Rod's... thing."
T.M.I. suck 	 I sucked Rod's... thing."
T.M.I. ass 	 An infatuated little girl, dependent on her family's prestige and on the ass-kissing of her friends .
T.M.I. fuck 	 That's why she probably fucked the fathers of all my enemies.
T.M.I. ass 	 Not my fault Prada makes ugly-ass hoodies."
T.M.I. fuck 	 " But she doesn't contradict me that designer hoodies are ugly as fuck. "
T.M.I. fuck 	 I open my mouth to end our interaction with my usual "Fuck off, mother," but I change my mind.
T.M.I. fuck 	 What the fuck is wrong with you?
T.M.I. fuck 	 What the fuck is wrong with you?
T.M.I. fuck 	 It's weird, illegal and scary as fuck.
T.M.I. fuck 	 Fuc

T.M.I. kiss 	 And just like that, he meets me halfway and kisses me.
T.M.I. kiss 	 His kisses move from my lips to my jawline and his mouth lingers next to my ear.
T.M.I. kiss 	 "Kiss me goodnight then."
T.M.I. kiss 	 Instead, he leans over me and plants a kiss on my lips.
T.M.I. kiss 	 At some point, I was certain he'd come to bed with me, touched me, kissed me some more.
T.M.I. kiss 	 As the words form in my mind, my body does a sudden lurch and I'm unpleasantly reminded of how much I kissed him last night and how he held me in his lap.
T.M.I. suck 	 That drug sucks.
T.M.I. kiss 	 And yet, I'm curious if his kisses feel like fire when I'm not stoned out of my mind.
T.M.I. kiss 	 I part my lips and let him kiss me.
T.M.I. kiss 	 Like the way he places one head behind my head and deepens the kiss, like the way our tongues wrestle for dominance.
T.M.I. fuck 	 Romantic as fuck.
T.M.I. kiss 	 My mind is still spinning as I try to do something with that kiss.
T.M.I. fuck 	 How he fucked th

As we can see, even mystery thrillers have obscene scenes. Now, I am not surprised this story ranked so high.

It seems using this very simple approach works in our case. For more scientific questions, it would not be enough and we would have to implement much sophisticated algorithms than to just look for occurences of 30 words and base our conclusions on that.

From what we have seen above, we can say that Wattpad could be a very nasty place for someone who enjoys top-quality literature. Although we have not inspected all stories in detail, we can say majority of the 30 stories which won 2020 Wattys Award contain (at least) romantic scenes and explicit language. 

I was surprised by these results. I expected the stories to be better than what they seem to be. Especially because they won the award. 