# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

In [1]:
import pandas as pd

In [2]:
origina_titles = pd.read_csv('10_titles.csv')
original_credits = pd.read_csv('10_credits.csv')

In [3]:
cleand_titles = origina_titles.copy()
cleand_credits = original_credits.copy()

In [4]:
cleand_titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


In [5]:
cleand_titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(

In [6]:
cleand_titles[cleand_titles['imdb_score'].isnull()]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
75,tm132164,Bill Hicks: Sane Man,MOVIE,Sane Man was filmed before Bill recorded ‘Dang...,1989,R,80,['comedy'],['US'],,,,,3.377,7.5
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991,TV-G,18,"['documentation', 'family', 'reality']",['JP'],12.0,,,,7.730,7.8
177,ts224786,Escalona,SHOW,"The improbable real life of Rafael Escalona, w...",1991,TV-MA,44,['drama'],['CO'],1.0,,,,7.352,7.6
180,tm8687,Sam Kinison: Family Entertainment Hour,MOVIE,Get ready to scream with laughter! Sam Kinison...,1991,,49,['comedy'],['US'],,,,,1.840,6.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5810,tm1225897,Social Man,MOVIE,Two competitive social media Influencers go he...,2021,,96,"['comedy', 'drama']",[],,tt20198164,,,,
5833,ts307884,HQ Barbers,SHOW,When a family run barber shop in the heart of ...,2021,TV-14,24,['comedy'],['NG'],1.0,,,,0.840,
5840,tm1216735,Sun of the Soil,MOVIE,"In 14th-century Mali, an ambitious young royal...",2022,,26,[],[],,,,,1.179,7.0
5844,tm1074617,Bling Empire - The Afterparty,MOVIE,"The stars of ""Bling Empire"" discuss the show's...",2021,,35,[],['US'],,,,,,


In [7]:
cleand_titles.dropna(subset = ['imdb_score'], inplace = True)

In [8]:
cleand_titles['imdb_score'].isnull().sum()

0

In [9]:
cleand_titles['id'].duplicated().sum()

0

In [10]:
cleand_titles[''] = cleand_titles['genres'].str.split(',', expand=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
1,['drama','crime'],,,,,,,,
2,['drama','action','thriller','european'],,,,,,
3,['fantasy','action','comedy'],,,,,,,
4,['war','action'],,,,,,,,
5,['comedy','european'],,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
5843,['romance','comedy','drama'],,,,,,,
5845,['romance','drama'],,,,,,,,
5846,['drama'],,,,,,,,,
5847,['comedy'],,,,,,,,,


In [11]:
cleand_titles['genres'].value_counts()

genres
['comedy']                                                       436
['drama']                                                        272
['documentation']                                                261
['comedy', 'drama']                                              123
['drama', 'romance']                                             121
                                                                ... 
['drama', 'war', 'action', 'thriller', 'history', 'european']      1
['thriller', 'crime', 'drama', 'western']                          1
['drama', 'scifi', 'fantasy', 'horror']                            1
['horror', 'fantasy', 'thriller']                                  1
['documentation', 'music', 'reality']                              1
Name: count, Length: 1695, dtype: int64

In [12]:
cleand_titles.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,5368.0,5368.0,1939.0,5368.0,5352.0,5293.0,5145.0
mean,2016.298994,78.562407,2.232078,6.510861,23439.38,23.471931,6.823868
std,7.005642,38.711435,2.769952,1.163826,95820.47,84.058311,1.145812
min,1954.0,0.0,1.0,1.5,5.0,0.6,1.0
25%,2016.0,45.0,1.0,5.8,516.75,2.939,6.1
50%,2018.0,86.0,1.0,6.6,2233.5,7.175,6.9
75%,2020.0,105.0,2.5,7.3,9494.0,17.211,7.5
max,2022.0,225.0,42.0,9.6,2294231.0,2274.044,10.0


In [13]:
cleand_credits.head()

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


In [14]:
cleand_credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB
