# Part 1: Web Scraping
## Web Scraping
### Task: Scrap the following information from IMDb Top 250 movies:

#### 1. Name
#### 2. Rating
#### 3. Number of votes
#### 4. Release year
#### 5. Country/Region
#### 6. Genre

## Demo of per movie's data structure,
Grab from HTML

In [None]:
<div class="item">
    <div class="pic">
        <em class="">1</em>
        <a href="https://movie.douban.com/subject/1292052/">
            <img alt="肖申克的救赎" class="" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" width="100" />
        </a>
    </div>
    <div class="info">
        <div class="hd">
            <a class="" href="https://movie.douban.com/subject/1292052/">
                <span class="title">肖申克的救赎</span>
                <span class="title"> / The Shawshank Redemption</span>
                <span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
            </a>
            <span class="playable">[可播放]</span>
        </div>
        <div class="bd">
            <p class="">
                导演: 弗兰克·德拉邦特 Frank Darabont   主演: 蒂姆·罗宾斯 Tim Robbins /...<br />
                1994 / 美国 / 犯罪 剧情
            </p>
            <div class="star">
                <span class="rating5-t"></span>
                <span class="rating_num" property="v:average">9.7</span>
                <span content="10.0" property="v:best"></span>
                <span>3111801人评价</span>
            </div>
            <p class="quote">
                <span class="inq">希望让人自由。</span>
            </p>
        </div>
    </div>
</div>

In [17]:
# Write by suqiulin 
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, re

def get_page_movies(url, headers):
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Get movie list
    pages_items = soup.find_all('div', class_='item')
    # Parse data...
    return pages_items

def extract_from_per_movie(per_movie):
    def deal_special_element():
        # 获取电影导演、演员、年份、上映地区等信息
        special_info = per_movie.find('div', class_='bd').find('p').text.strip()
        # 这条数据包含了很多信息，需要使用正则拆分开
        pattern = re.compile(r"导演: (.*?)\s+主演: (.*?)\s+(\d{4})\s+/\s+(.*?)\s+/\s+(.*)")
        match = re.search(pattern, special_info)
        special_dict = {}
        if match:
            director = match.group(1).strip()
            actors = match.group(2).strip()
            year = match.group(3).strip()
            countries = match.group(4).strip().split(' ')
            genres = match.group(5).strip().split(' ')
            special_dict = {'director': director, 'actors': actors, 'release_year': year, 'country': countries, 'genre': genres}
        return special_dict
    # 获取排名
    rank = per_movie.find('em').text.strip()
    # 获取电影标题
    title = per_movie.find('span', class_='title').text.strip()
    # 获取评分信息
    rating_num = per_movie.find('span', class_='rating_num').text.strip()
    # 获取评价人数信息
    rate_people_num = per_movie.find('div', class_='star').find_all('span')[3].text.strip()
    need_data = {'rank': rank, 'name': title, 'rating': rating_num, 'votes': rate_people_num}
    
    special_dict = deal_special_element(per_movie)
    need_data.update(special_dict)
    print(need_data)
    return need_data

if __name__ == "__main__":
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'}
    base_url = 'https://movie.douban.com/top250?start={start}'
    base_number, page_size = 0, 25
    url_list = []
    # 每个电影包含的6个特征，按排名排序
    ranked_movie_list = []
    for index in range(0, 10):
        url_list.append(base_url.format(start=base_number))
        base_number += page_size

    for url in url_list:
        page_movies = get_page_movies(url, headers=headers)
        for per_movie in page_movies:
            per_need_data = extract_from_per_movie(per_movie)
            ranked_movie_list.append(per_need_data)

TypeError: extract_from_per_movie.<locals>.deal_special_element() takes 0 positional arguments but 1 was given

In [None]:
from bs4 import BeautifulSoup
import requests, time, re
from random import randint
import pandas as pd

url_list = ['https://movie.douban.com/top250']
base_url = 'https://movie.douban.com/top250?start={start}'
for start in range(25, 251, 25):
    url_list.append(base_url.format(start=start))

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0'}
movie_info = []
details = []

for url in url_list:
    time.sleep(randint(1, 3))
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    pages_items = soup.find_all('div', class_='item')
    for movie in pages_items:
        # 获取排名
        rank = movie.find('em').text.strip()
        # 获取电影标题
        title = movie.find('span', class_='title').text.strip()
        # 获取电影导演、演员、年份、上映地区等信息
        info = movie.find('div', class_='bd').find('p').text.strip()
        # 由于info这条数据包含了很多信息，需要使用正则拆分开
        #print(info)
        pattern = re.compile(r"导演: (.*?)\s+主演: (.*?)\s+(\d{4})\s+/\s+(.*?)\s+/\s+(.*)")
        match = re.search(pattern, info)
        if match:
            director = match.group(1).strip()
            actors = match.group(2).strip()
            year = match.group(3).strip()
            countries = match.group(4).strip().split(' ')
            genres = match.group(5).strip().split(' ')

        # 获取评分信息
        rating_num = movie.find('span', class_='rating_num').text.strip()
        # 获取评价人数信息
        rate_people_num = movie.find('div', class_='star').find_all('span')[3].text.strip()
        # 将信息进行汇总
        need_data = {'排名': rank, '电影名称': title, '导演': director, '演员': actors, '上映年份': year, '上映地区': countries, '电影类型': genres,'评分': rating_num,
                     '投票人数': rate_people_num}
        print(need_data)
        movie_info.append(need_data)

df = pd.DataFrame(movie_info,columns=['排名', '电影名称', '导演', '演员', '上映年份', '上映地区', '电影类型', '评分', '投票人数'])
excel_path = 'movie_info.xlsx'
df.to_excel(excel_path, index=False)



{'排名': '1', '电影名称': '肖申克的救赎', '导演': '弗兰克·德拉邦特 Frank Darabont', '演员': '蒂姆·罗宾斯 Tim Robbins /...', '上映年份': '1994', '上映地区': ['美国'], '电影类型': ['犯罪', '剧情'], '评分': '9.7', '投票人数': '3111934人评价'}
{'排名': '2', '电影名称': '霸王别姬', '导演': '陈凯歌 Kaige Chen', '演员': '张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...', '上映年份': '1993', '上映地区': ['中国大陆', '中国香港'], '电影类型': ['剧情', '爱情', '同性'], '评分': '9.6', '投票人数': '2296516人评价'}
{'排名': '3', '电影名称': '阿甘正传', '导演': '罗伯特·泽米吉斯 Robert Zemeckis', '演员': '汤姆·汉克斯 Tom Hanks / ...', '上映年份': '1994', '上映地区': ['美国'], '电影类型': ['剧情', '爱情'], '评分': '9.5', '投票人数': '2315487人评价'}
{'排名': '4', '电影名称': '泰坦尼克号', '导演': '詹姆斯·卡梅隆 James Cameron', '演员': '莱昂纳多·迪卡普里奥 Leonardo...', '上映年份': '1997', '上映地区': ['美国', '墨西哥'], '电影类型': ['剧情', '爱情', '灾难'], '评分': '9.5', '投票人数': '2357147人评价'}
{'排名': '5', '电影名称': '千与千寻', '导演': '宫崎骏 Hayao Miyazaki', '演员': '柊瑠美 Rumi Hîragi / 入野自由 Miy...', '上映年份': '2001', '上映地区': ['日本'], '电影类型': ['剧情', '动画', '奇幻'], '评分': '9.4', '投票人数': '2404662人评价'}
{'排名': '6', '电影名称': '美丽人生', '导演': '罗伯托·贝尼尼 R

ModuleNotFoundError: No module named 'openpyxl'

# Part 2: Basic Data Analysis Tasks

## Data Cleaning and Statistics
### Task: Calculate the average rating of all movies and find the top 5 highest-rated movies.

In [None]:
# Hint code
df = pd.read_csv('imdb_movies.csv')
# Calculate mean rating
mean_rating = ...
# Find top 5 movies
top_5 = ...

## Decade Analysis
### Task: Group movies by decades (1990s, 2000s, 2010s, etc.) and calculate the number of movies and average rating for each decade.


In [None]:
# Hint code
df['decade'] = (df['year'] // 10) * 10
decade_stats = ...

## Genre Analysis
### Task: Count the number of movies in each genre (note that a movie can have multiple genres).

In [None]:
# Hint code
# Split genre string
genres = df['genre'].str.split(',')
# Count each genre
genre_counts = ...

## Country Analysis
### Task: Find the top three countries with the largest number of movies and calculate their average ratings.

In [None]:
# Hint code
country_stats = df.groupby('country').agg({
    'title': 'count',
    'rating': 'mean'
})

## Correlation Analysis
### Task: Analyze the correlation between movie ratings and number of votes, and create a scatter plot.

In [None]:
# Hint code
import matplotlib.pyplot as plt
correlation = df['rating'].corr(df['votes'])
plt.scatter(...)

## I encourage to attempt each task independently before referring to the hint code. Each task can be extended further, such as adding more detailed analysis or improving visualizations.