| **ATTRIBUTES**            |**MEANING**               |
|:----------------------|:-------------------------------------------------------------|
|**`Title`**            | Title of the manga (written in English phonetic)                                                |
|**`Score`**            | Score on the MyAnimeList site (MAL)                                                             |
|**`Vote`**             | Number of readers voting for the manga                                                          |
|**`Ranked`**           | Ranking of manga on the web MyAnimeList (MAL)                                                   |
|**`Popularity`**       | The popularity of the manga                                                                     |
|**`Members`**          | Number of readers who have this manga in their list                                             |
|**`Favorite`**         | Number of readers who love this manga                                                           |
|**`Type`**		        | Type (manga/manhwa/lightnovel...)                                                               |
|**`Volumes`**          | Number of volumes of manga                                                                      |
|**`Chapters`**         | Number of chapters of manga                                                                     |
|**`Status`**           | Status of the manga (ongoing, completed, on hiatus,...)                                         |
|**`Published`**        | Release time to the end time of the manga                                                       |
|**`Genres`**           | Genres of manga                                                                                 |
|**`Themes`**           | The themes of the manga                                                                         |
|**`Demographics`** 	| Target demographic (e.g., Shounen).                                                             |
|**`Serialization`** 	| Manga serialization information (e.g., Shounen Jump).                                           |
|**`Author`**           | Author of manga                                                                                 |
|**`Total Review`**     | Number of readers leaving comments on the manga                                                 |
|**`Type Review`**      | Number of readers for each comment category (Recommended / Mixed feeling / Not recommended)     | / Not recommended)

In [1]:
!pip install requests-html

Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl.metadata (15 kB)
Collecting pyquery (from requests-html)
  Downloading pyquery-2.0.1-py3-none-any.whl.metadata (9.0 kB)
Collecting fake-useragent (from requests-html)
  Downloading fake_useragent-1.5.1-py3-none-any.whl.metadata (15 kB)
Collecting parse (from requests-html)
  Downloading parse-1.20.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting w3lib (from requests-html)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting pyppeteer>=0.0.14 (from requests-html)
  Downloading pyppeteer-2.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pyee<12.0.0,>=11.0.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading pyee-11.1.1-py3-none-any.whl.metadata (2.8 kB)
Collecting websockets<11.0,>=10.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading websockets-10.4-cp310-cp310-manylinux_2_5_x86_6

In [2]:
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.1


In [4]:
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import re
import nest_asyncio
import pandas as pd 
import datetime
import time

### Crawl data URLS

In [5]:
nest_asyncio.apply() 
session = HTMLSession()

In [6]:
listUrl4 = []

for i in range(15000,20000,50):
    # Url of the website to scrap
    url = f'https://myanimelist.net/topmanga.php?limit={i}'

    # Get the html content
    html = requests.get(url).text

    # Parse the html content
    soup = BeautifulSoup(html, "html.parser")

    # Get the list of manga
    listItem = soup.find_all("td", {"class": "title al va-t clearfix word-break"})

    # Get the url of each manga
    for item in listItem:
        listUrl4.append(item.find('a').get('href'))

    # Print the number of manga urls collected
    print(f'{len(listUrl4)} urls collected', end='\r', flush=True)

5000 urls collected

In [7]:
listUrl = listUrl4
print(f'Total: {len(listUrl)} urls collected')

Total: 5000 urls collected


In [8]:
with open("/kaggle/working/link_collecting_4.txt", "w") as file:
    file.writelines(item + "\n" for item in listUrl)

### Crawl HTML content from the manga/light novel/... URLs

In [9]:
listHtml4 = []

for url in listUrl[0:5000]:
    res = session.get(url)
    while len(res.text) < 4000:
        # Sleep for 10 minutes
        time.sleep(200)
        res = session.get(url)
        
    listHtml4.append(res.text)

    # Print the number of manga html collected
    print(f'{len(listHtml4)}/{len(listUrl)} manga html collected', end='\r', flush=True)

5000/5000 manga html collected

In [10]:
# Extract time of data collection to report for the project
now = datetime.datetime.now()
now = now.strftime("%Y-%m-%d")
print("Time of data collection: ", now)

Time of data collection:  2024-11-16


In [12]:
listHtml = listHtml4
print(f'Total: {len(listHtml)} manga html collected')

Total: 5000 manga html collected


In [14]:
def extract_info(htmlComic):
    soup = BeautifulSoup(htmlComic, "html.parser")

    title = soup.find('span', {'itemprop': 'name'})
    if title is None:
        return None
    else:
        title_text = title.text.strip()
        title_english_span = title.find('span', {'class': 'title-english'})

        if title_english_span is not None:
            title_english_text = title_english_span.text.strip()
            title_text = title_text.replace(title_english_text, '')
            title = f'{title_text} ({title_english_text})'
        else:
            title = title_text
    # ratingValue = soup.find('span', {'itemprop': 'ratingValue'}).text
    # ratingCount = soup.find('span', {'itemprop': 'ratingCount'}).text
    try:
        ratingValue = soup.find('span', {'itemprop': 'ratingValue'}).text
    except AttributeError:
        ratingValue = 'N/A'
    
    try:
        ratingCount = soup.find('span', {'itemprop': 'ratingCount'}).text
    except AttributeError:
        ratingCount = '-'
    ranked = re.findall(r'\d+', soup.find('span', {'class': 'numbers ranked'}).text)[0]
    popularity = re.findall(r'\d+', soup.find('span', {'class': 'numbers popularity'}).text)[0]

    volumes, chapters, status, published = '', '', '', ''
    genres, themes, authors, favorites, members = [], [], '', '', ''

    for space in soup.find_all("div", {'class': 'spaceit_pad'}):
        text = space.text
        if 'Volumes' in text:
            volumes = text.split(':')[1].strip()
        elif 'Chapters' in text:
            chapters = text.split(':')[1].strip()
        elif 'Status' in text:
            status = text.split(':')[1].strip()
        elif 'Published' in text:
            published = text.split(':')[1].strip()
        elif 'Genres' in text:
            genres = [gen.text for gen in space.find_all('span', {'itemprop': 'genre'})]
        elif 'Themes' in text:
            themes = [theme.text for theme in space.find_all('span', {'itemprop': 'genre'})]
        elif 'Authors' in text:
            authors = text.split(':')[1].strip()
        elif 'Favorites' in text:
            favorites = text.split(':')[1].strip()
        elif 'Members' in text:
            members = text.split(':')[1].strip()

    infoReviews = soup.find('div', {'class': 'manga-info-review__header mal-navbar'})
    totalReviews = re.findall(r'\d+', infoReviews.find('div', {'class': 'right'}).text)[0]

    typeReview = [
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'recommended'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'mixed-feelings'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'not-recommended'}).text)[0])
    ]

    return {
        "Title": title, "Score": ratingValue, "Vote": ratingCount,
        "Ranked": ranked, "Popularity": popularity, "Members": members,
        "Favorite": favorites, "Volumes": volumes, "Chapters": chapters,
        "Status": status, "Published": published, "Genres": genres,
        "Themes": themes, "Author": authors, "Total Review": totalReviews,
        "Type Review": typeReview
    }

data_list = [extract_info(htmlComic) for htmlComic in listHtml if extract_info(htmlComic) is not None]
df = pd.DataFrame(data_list)

In [15]:
df.head()

Unnamed: 0,Title,Score,Vote,Ranked,Popularity,Members,Favorite,Volumes,Chapters,Status,Published,Genres,Themes,Author,Total Review,Type Review
0,Akuma ni Chic x Hack,6.59,292,15001,13845,1197,1,2,12,Finished,"Mar 19, 2016 to Oct 5, 2016","[Fantasy, Romance]",[],"Tanemura, Arina (Story & Art)",0,"[0, 0, 0]"
1,Manuke na FPS Player ga Isekai e Ochita Baai,6.59,2430,15002,2741,7921,22,Unknown,Unknown,Publishing,"Feb 9, 2016 to ?","[Action, Fantasy]",[],"Saiki, Junichi (Art), Jiraigen (Story)",4,"[4, 0, 0]"
2,Modokidomo,6.59,795,15003,9693,1934,0,2,20,Finished,"Jul 21, 2016 to Jun 15, 2017",[],[],"Aokawa, Nana (Story & Art)",1,"[1, 0, 0]"
3,Doukyuusei ni Koi wo Shita (My Bittersweet Crush),6.59,170,15004,19096,730,7,7,30,Finished,"Mar 3, 2016 to Apr 3, 2018","[Comedy, Romance]",[],"Miasa, Rin (Story & Art)",1,"[0, 0, 1]"
4,"Konna Amai Koto, Shiranai....",6.59,228,15005,19127,728,0,1,5,Finished,"Nov 11, 2016 to Jan 13, 2017",[],[],"Nishino, Kiina (Story & Art)",1,"[0, 0, 1]"


In [19]:
df.to_csv('/kaggle/working/raw_manga.csv', encoding='utf-8-sig', index=False)