| **ATTRIBUTES**            |**MEANING**               |
|:----------------------|:-------------------------------------------------------------|
|**`Title`**            | Title of the manga (written in English phonetic)                                                |
|**`Score`**            | Score on the MyAnimeList site (MAL)                                                             |
|**`Vote`**             | Number of readers voting for the manga                                                          |
|**`Ranked`**           | Ranking of manga on the web MyAnimeList (MAL)                                                   |
|**`Popularity`**       | The popularity of the manga                                                                     |
|**`Members`**          | Number of readers who have this manga in their list                                             |
|**`Favorite`**         | Number of readers who love this manga                                                           |
|**`Type`**		        | Type (manga/manhwa/lightnovel...)                                                               |
|**`Volumes`**          | Number of volumes of manga                                                                      |
|**`Chapters`**         | Number of chapters of manga                                                                     |
|**`Status`**           | Status of the manga (ongoing, completed, on hiatus,...)                                         |
|**`Published`**        | Release time to the end time of the manga                                                       |
|**`Genres`**           | Genres of manga                                                                                 |
|**`Themes`**           | The themes of the manga                                                                         |
|**`Demographics`** 	| Target demographic (e.g., Shounen).                                                             |
|**`Serialization`** 	| Manga serialization information (e.g., Shounen Jump).                                           |
|**`Author`**           | Author of manga                                                                                 |
|**`Total Review`**     | Number of readers leaving comments on the manga                                                 |
|**`Type Review`**      | Number of readers for each comment category (Recommended / Mixed feeling / Not recommended)     | / Not recommended)

In [1]:
!pip install requests-html

Collecting requests-html
  Downloading requests_html-0.10.0-py3-none-any.whl.metadata (15 kB)
Collecting pyquery (from requests-html)
  Downloading pyquery-2.0.1-py3-none-any.whl.metadata (9.0 kB)
Collecting fake-useragent (from requests-html)
  Downloading fake_useragent-1.5.1-py3-none-any.whl.metadata (15 kB)
Collecting parse (from requests-html)
  Downloading parse-1.20.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting bs4 (from requests-html)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting w3lib (from requests-html)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting pyppeteer>=0.0.14 (from requests-html)
  Downloading pyppeteer-2.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pyee<12.0.0,>=11.0.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading pyee-11.1.1-py3-none-any.whl.metadata (2.8 kB)
Collecting websockets<11.0,>=10.0 (from pyppeteer>=0.0.14->requests-html)
  Downloading websockets-10.4-cp310-cp310-manylinux_2_5_x86_6

In [2]:
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.4.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.4.1-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.1


In [3]:
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import re
import nest_asyncio
import pandas as pd 
import datetime
import time

In [4]:
nest_asyncio.apply() # Cho phép một vòng lặp sự kiện đang chạy chấp nhận các vòng lặp con.
session = HTMLSession() # Render JavaScript hoặc xử lý nội dung động của trang web (comment...)

### Crawl URLS

In [5]:
listUrl1 = []

for i in range(0, 5000, 50):
    # Url of the website to scrap
    url = f'https://myanimelist.net/topmanga.php?limit={i}'

    # Get the html content
    html = requests.get(url).text

    # Parse the html content
    soup = BeautifulSoup(html, "html.parser")

    # Get the list of manga
    listItem = soup.find_all("td", {"class": "title al va-t clearfix word-break"})

    # Get the url of each manga
    for item in listItem:
        listUrl1.append(item.find('a').get('href'))

    # Print the number of manga urls collected
    print(f'{len(listUrl1)} urls collected', end='\r', flush=True)

5000 urls collected

### Concatenate list URLs

In [6]:
listUrl = listUrl1
print(f'Total: {len(listUrl)} urls collected')

Total: 5000 urls collected


In [7]:
with open("/kaggle/working/link_collecting_1.txt", "w") as file:
    file.writelines(item + "\n" for item in listUrl1)

<a class="anchor" id="collect_data"></a>

## <span style='color:#2B9C15 '> 📕 Collect data of each manga  </span>
1. From each url collected above, send a GET request to get the HTML content of the page.
2. If length of the HTML content is smaller than 4000 , sleep for 10 seconds and send the GET request again. Because that means the website has blocked the connection and we need to wait for a while before sending the request again.
3. Save the HTML content in a list for parsing later.

This process still splits into 2 parts, each part collects 5000 HTML contents to avoid the connection being interrupted by the website due to too many requests.

### 👉 Crawl HTML content from the 20000 manga/light novel/... URLs

In [8]:
listHtml1 = []

for url in listUrl[0:5000]:
    res = session.get(url)
    while len(res.text) < 4000:
        # Sleep for 10 minutes
        time.sleep(200)
        res = session.get(url)
        
    listHtml1.append(res.text)

    # Print the number of manga html collected
    print(f'{len(listHtml1)}/{len(listUrl)} manga html collected', end='\r', flush=True)

5000/5000 manga html collected

In [9]:
# Extract time of data collection to report for the project
now = datetime.datetime.now()
now = now.strftime("%Y-%m-%d")
print("Time of data collection: ", now)

Time of data collection:  2024-11-17


In [10]:
listHtml = listHtml1
print(f'Total: {len(listHtml)} manga html collected')

Total: 5000 manga html collected


### 👉 Extracting the detailed values of each comic website page

In [21]:
def extract_info(htmlComic):
    soup = BeautifulSoup(htmlComic, "html.parser")

    title = soup.find('span', {'itemprop': 'name'})
    if title is None:
        return None
    else:
        title_text = title.text.strip()
        title_english_span = title.find('span', {'class': 'title-english'})

        if title_english_span is not None:
            title_english_text = title_english_span.text.strip()
            title_text = title_text.replace(title_english_text, '')
            title = f'{title_text} ({title_english_text})'
        else:
            title = title_text
    
    ratingValue = soup.find('span', {'itemprop': 'ratingValue'}).text
    ratingCount = soup.find('span', {'itemprop': 'ratingCount'}).text
    ranked = re.findall(r'\d+', soup.find('span', {'class': 'numbers ranked'}).text)[0]
    popularity = re.findall(r'\d+', soup.find('span', {'class': 'numbers popularity'}).text)[0]

    volumes, chapters, status, published = '', '', '', ''
    genres, themes, authors, favorites, members = [], [], '', '', ''
    type_, demographic, serialization = '', '', ''

    for space in soup.find_all("div", {'class': 'spaceit_pad'}):
        text = space.text.strip()
        
        if 'Type:' in text:
            type_ = text.split(':', 1)[1].strip()
        elif 'Volumes:' in text:
            volumes = text.split(':', 1)[1].strip()
        elif 'Chapters:' in text:
            chapters = text.split(':', 1)[1].strip()
        elif 'Status:' in text:
            # Lấy nội dung sau thẻ <span class="dark_text">
            status = space.find('span', {'class': 'dark_text'}).next_sibling.strip()
        elif 'Published:' in text:
            published = text.split(':', 1)[1].strip()
        elif 'Genres:' in text or 'Genre:' in text:
            genres = [gen.text.strip() for gen in space.find_all('a')]
        elif 'Themes:' in text or 'Theme:' in text:
            # Lấy cả giá trị từ <a> và <span itemprop="genre">
            themes = [theme.text.strip() for theme in space.find_all('a')]
        elif 'Demographic:' in text or 'Demographics:' in text:
            demographic = space.find('a').text.strip()
        elif 'Serialization:' in text or 'Serializations:' in text:
            # serialization = space.find('a').text.strip()
            serialization_tag = space.find('a')  # Tìm thẻ <a>
            serialization = serialization_tag.text.strip() if serialization_tag else ''  # Kiểm tra nếu không tìm thấy
        elif 'Authors:' in text or 'Author:' in text:
            authors = text.split(':')[1].strip()
            # authors = space.find('a').text.strip()
            # author_tag = space.find('a')  # Tìm thẻ <a>
            # authors = author_tag.text.strip() if author_tag else ''  # Kiểm tra nếu không tìm thấy
        elif 'Favorites:' in text:
            favorites = text.split(':', 1)[1].strip()
        elif 'Members:' in text:
            members = text.split(':', 1)[1].strip()

    infoReviews = soup.find('div', {'class': 'manga-info-review__header mal-navbar'})
    totalReviews = re.findall(r'\d+', infoReviews.find('div', {'class': 'right'}).text)[0]

    typeReview = [
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'recommended'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'mixed-feelings'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'not-recommended'}).text)[0])
    ]

    return {
        "Title": title, "Score": ratingValue, "Vote": ratingCount,
        "Ranked": ranked, "Popularity": popularity, "Members": members,
        "Favorite": favorites, "Types": type_, "Volumes": volumes, 
        "Chapters": chapters, "Status": status, "Published": published, 
        "Genres": genres, "Themes": themes, "Demographic": demographic, "Serialization": serialization, 
        "Author": authors, "Total Review": totalReviews, "Type Review": typeReview
    }

# data_list = [extract_info(htmlComic) for htmlComic in listHtml if extract_info(htmlComic) is not None]
# df = pd.DataFrame(data_list)
data_list = []
for idx, htmlComic in enumerate(listHtml, start=1):
    result = extract_info(htmlComic)
    if result is not None:
        data_list.append(result)
    # In trạng thái sau khi duyệt mỗi phần tử
    print(f"Đã xử lý {idx}/{len(listHtml)} phần tử.", end='\r', flush=True)
    # print(f'{len(listUrl1)} urls collected', end='\r', flush=True)

df = pd.DataFrame(data_list)

Đã xử lý 5000/5000 phần tử.

In [22]:
df.head()

Unnamed: 0,Title,Score,Vote,Ranked,Popularity,Members,Favorite,Types,Volumes,Chapters,Status,Published,Genres,Themes,Demographic,Serialization,Author,Total Review,Type Review
0,Berserk,9.47,363720,1,1,725079,130489,Manga,Unknown,Unknown,Publishing,"Aug 25, 1989 to ?","[Action, Adventure, Award Winning, Drama, Fant...","[Gore, Military, Mythology, Psychological]",Seinen,Young Animal,"Miura, Kentarou (Story & Art), Studio Gaga (Art)",289,"[252, 17, 20]"
1,JoJo no Kimyou na Bouken Part 7: Steel Ball Ru...,9.31,172219,2,23,280428,46269,Manga,24,96,Finished,"Jan 19, 2004 to Apr 19, 2011","[Action, Adventure, Mystery, Supernatural]",[Historical],Seinen,Ultra Jump,"Araki, Hirohiko (Story & Art)",131,"[123, 7, 1]"
2,Vagabond,9.26,154583,3,13,406082,44258,Manga,37,327,On Hiatus,"Sep 3, 1998 to May 21, 2015","[Action, Adventure, Award Winning]","[Historical, Samurai]",Seinen,Morning,"Inoue, Takehiko (Story & Art), Yoshikawa, Eiji...",104,"[93, 9, 2]"
3,One Piece,9.22,392811,4,4,642620,119974,Manga,Unknown,Unknown,Publishing,"Jul 22, 1997 to ?","[Action, Adventure, Fantasy]",[],Shounen,Shounen Jump (Weekly),"Oda, Eiichiro (Story & Art)",231,"[190, 21, 20]"
4,Monster,9.16,104327,5,29,258581,22008,Manga,18,162,Finished,"Dec 5, 1994 to Dec 20, 2001","[Award Winning, Drama, Mystery]","[Adult Cast, Psychological]",Seinen,Big Comic Original,"Urasawa, Naoki (Story & Art)",86,"[69, 11, 6]"


In [25]:
df.to_csv('/kaggle/working/raw_manga_1.csv', encoding='utf-8-sig', index=False)