# Webtoons: What makes a Webtoon Series Popular in 2022?
## Data Science Tutorial
Mei Lu
CMSC320

#### Background
Webtoons are a type of digital comic that originated in South Korea usually meant to be read on computers and smartphones. Webtoons usually feature a few of common traits: each episode is published on one long, vertical strip (making use of an infinite canvas rather than multiple pages so that it is easier to read on a smartphone or computer); some will feature music and animations that play during each chapter; and unlike the majority of East Asian comics, they will most likely be in color rather than black-and-white since they are published digitally on a website or app rather than physically in a magazine.

[Source](https://en.wikipedia.org/wiki/Webtoon)

#### Motivation

I have always been a fan of reading comics and Webtoons since I was a kid. Nowadays with the explosion of Webtoon on a international scale, they have become so popular that TV adaptations of them have come out. I would like to find out what elements make a webtoon series successful and amass millions of readers. Is there a special formula? Is it pure luck? Has the elements to be popular changed over the years? Let's find out what makes a webtoon series popular in current times by analyzing some data scraped from the [Webtoons](https://webtoons.com) site.

### Part 1: Data Collection

Let's begin by loading the data and taking care of the required imports!

The dataset contains +700 unique webtoons with 19 columns containing general information. The information contained in this dataset includes genres, authors, views, likes and etc. 

[Dataset Source](https://www.kaggle.com/datasets/victorsoeiro/webtoons-dataset)

I chose this data source because it contains premium webtoons, which are the ones most that readers can pay to read ahead for. Since anyone can publish a webtoon on the site, this eliminates all the webtoons that are not popular with no readership, so that there is less tidying of the data that we have to do. The formatting of this dataset is also well done, with few to no inconsistencies.

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [44]:
# Import the data and peek the beginning of the table

dataset = pd.read_csv('data.csv')
dataset.head()

Unnamed: 0,webtoon_id,title,genre,thumbnail,summary,episodes,Created by,view,subscribe,grade,released_date,url,cover,likes,Written by,Art by,Adapted by,Original work by,Assisted by
0,1218,Let's Play,Romance,https://webtoon-phinf.pstatic.net/20210629_103...,"She’s young, single and about to achieve her d...",171,Leeanne M. Krecic (Mongie),606.5M,4.6M,9.59,"Nov 6, 2017",https://www.webtoons.com/en/romance/letsplay/l...,https://webtoon-phinf.pstatic.net/20210629_163...,37.2M,,,,,
1,1436,True Beauty,Romance,https://webtoon-phinf.pstatic.net/20210129_175...,"After binge-watching beauty videos online, a s...",197,Yaongyi,874M,7M,9.53,"Aug 15, 2018",https://www.webtoons.com/en/romance/truebeauty...,https://webtoon-phinf.pstatic.net/20210129_65/...,46.2M,,,,,
2,2135,The Remarried Empress,Fantasy,https://webtoon-phinf.pstatic.net/20200904_29/...,Navier Ellie Trovi was an empress perfect in e...,110,,231.3M,2.5M,9.87,"Sep 5, 2020",https://www.webtoons.com/en/fantasy/the-remarr...,https://webtoon-phinf.pstatic.net/20200904_268...,21.2M,Alphatart,Sumpul,HereLee,,
3,1798,Midnight Poppy Land,Romance,https://webtoon-phinf.pstatic.net/20191119_132...,After making a grisly discovery in the country...,99,Lilydusk,198.8M,2.3M,9.8,"Nov 22, 2019",https://www.webtoons.com/en/romance/midnight-p...,https://webtoon-phinf.pstatic.net/20191119_163...,13.5M,,,,,
4,3416,Reunion,Romance,https://webtoon-phinf.pstatic.net/20220311_196...,"After moving away for a decade, Rhea returns t...",9,stephattyy,7.1M,629872,9.77,"Mar 17, 2022",https://www.webtoons.com/en/romance/reunion/li...,https://webtoon-phinf.pstatic.net/20220311_14/...,570151,,,,,


#### Formatting

#### Check for Duplicates
First, let's check for duplicate entries in this dataset based on the series names in the table. 

In [46]:
# Check if all titles are unique
dataset['title'].is_unique # This is true!

True

#### Handling and Checking of NaN Data
It's not problematic to have missing values in the 'Written by', 'Art by', 'Adapted by', 'Original work by', 'Assisted by' etc.  columns as some of the information do not apply, but we do need to check if there are NaN values for the series' title, created by, genre, subscribe, likes, summary, episodes, and released_data columns since we will need to use them in our visualizations and analysis. If there exists NaN data in these important columns, we'll need to come up with a way to handle it.

In [47]:
#Check for NaN Values
print('Check for NaN values:')
print(dataset['released_date'].isnull().values.any())
print(dataset['title'].isnull().values.any())
print(dataset['genre'].isnull().values.any())
print(dataset['summary'].isnull().values.any())
print(dataset['episodes'].isnull().values.any())
print(dataset['view'].isnull().values.any())
print(dataset['subscribe'].isnull().values.any())
print(dataset['grade'].isnull().values.any())
print(dataset['likes'].isnull().values.any())

Check for NaN values:
False
False
False
False
False
False
False
False
False


#### Add a column for published year
Let's add a column to indicate the publishing year for the webtoon series. This will come in handy later when we compare their metrics by year, as some indicators such as 'likes' and 'subscribers' can accumulate more if the webtoon is older. Thus it is unfair to compare all the webtoons together without taking this into account.

In [48]:
# Add a column for publishing year
print('Preview of released_year column:')
dataset['released_year'] = pd.DatetimeIndex(dataset['released_date']).year
dataset['released_year'].head()

Preview of released_year column:


0    2017
1    2018
2    2020
3    2019
4    2022
Name: released_year, dtype: int64

#### Number Formatting
Like and subscriber counts sometimes abbreviate millions into the symbol 'M', which will not be helpful when we want to make calculations with it. Let's convert all of these text representations in the table into numerical values!

We must also convert these columns from string to numerical types.

In [50]:
# We can make a function since we have to apply this both to the subscribe column and the likes column

print("Before Conversion:")
print(dataset['subscribe'].head())
print(dataset['likes'].head())
def value_to_float(x):
    # Depending on the ending of the number, append the correct amount of zeroes
    if 'K' in x:
        if len(x) > 1:
            return float(x.replace('K', '')) * 1000
        return 1000.0
    if 'M' in x:
        if len(x) > 1:
            return float(x.replace('M', '')) * 1000000
        return 1000000.0
    if 'B' in x:
        return float(x.replace('B', '')) * 1000000000
    return x

dataset['subscribe'] = dataset['subscribe'].apply(value_to_float)
dataset['likes'] = dataset['likes'].apply(value_to_float).to_numeric()

# Convert to numeric type
pd.to_numeric(dataset['subscribe'])
pd.to_numeric(dataset['likes'])

print("After Conversion:")
print(dataset['subscribe'].head())
print(dataset['likes'].head())

Before Conversion:
0       4.6M
1         7M
2       2.5M
3       2.3M
4    629,872
Name: subscribe, dtype: object
0      37.2M
1      46.2M
2      21.2M
3      13.5M
4    570,151
Name: likes, dtype: object


AttributeError: 'Series' object has no attribute 'to_numeric'

### Part 2: Data Management / Representation

Now that we have verified values and tidied our data, let's do some visualizations to see if we find anything interesting!

- Median Subscribers by Published Year
- Correlation Between Likes and Subscribers
- Correlation Between Grade and Subscribers
- Top 10 Webtoon Authors' number of Series

#### Why Focus on Subscribership?
The above visualizations I have chosen all have to do with subscriber count (with the exception of one). When a reader subscribes to a story, indicating some sort of investment in it. Subscriptions matter more than views, since the website and application can choose to promote a webtoon, yet it's not an indicator of it having lasting popularity.

Subscribers can accumulate over time, thus this metric could bias towards older Webtoons. To mitigate this issue, I decided to also visualize a lot of the data by year to see a more accurate representation of the popular webtoons both of all time and recently. We can also compare webtoons published in the same year to have a better comparison between their subscriber counts. I chose to use the Median as a measure that isn't as affected by skew as the mean, to account for certain series that may be outliers and explode in popularity compared to the other webtoons.

In [42]:
year_list = []

# Median Subscribers by Published Year
group_yr = dataset.groupby('released_year')
for name, group in group_yr:
    year_list.append(name) # Adds current year to list

    for row_index, row in group.iterrows():
        print(row['subscribe'].median())
# np.median(dataset['subscribe'])
# plt.bar(year_list, )

AttributeError: 'str' object has no attribute 'median'

TypeError: Cannot use method 'nlargest' with dtype object

### Part 3:Exploratory Data Analysis


##### Does Genre Matter?

##### Does Rating Matter?

##### Does Number of Episodes Matter?

##### Does Release Date?


In [None]:
# Top 100 Subscribed Webtoons of All Time
top_100_sub_all = dataset.groupby(by="released_year")["subscribe"].nlargest(n=10)
top_100_sub_all

### Part 4: Hypothesis Testing

Linear Regression

### Part 5: Insights

#### Pros and Cons of Approach

#### Resources for further learning