# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

In [1]:
import pandas as pd

## 导入数据

In [2]:
#传入需要处理的数据
original_titles = pd.read_csv("/Users/hardy/Desktop/Python_file/Python_DataAnalyst/Data_Organize/Project_Organize/titles.csv")
original_credits = pd.read_csv("/Users/hardy/Desktop/Python_file/Python_DataAnalyst/Data_Organize/Project_Organize/credits.csv")

In [3]:
original_titles

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love story that can happen between...,2021,,100,"['romance', 'drama']",['NG'],,tt13857480,6.8,45.0,1.466,
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021,,134,['drama'],[],,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021,,90,['comedy'],['CO'],,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarrassing Me - The Afterparty,MOVIE,"Jamie Foxx, David Alan Grier and more from the...",2021,PG-13,37,[],['US'],,,,,1.296,10.000


In [4]:
original_credits

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


## 清洗数据

### 结构性问题—对特定列的列表进行拆分

### 拆分genres列

In [5]:
#抽取original_titles头5行数据
original_titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


从上面提取的样本数据可以发现，`original_titles`存在结构性问题，其中的`genres`和`production_countries`列的形式是列表，这不方便我们后续的对其进行分析，因而应运用explode函数将列表拆分成单独的行。

In [6]:
#查看genres列的格式
original_titles["genres"][1]

"['drama', 'crime']"

可以看到`genres`列中的元素为字符串，并非为列表，所以应该采用eval函数来将字符串转换为列表。

In [7]:
#使用eval的匿名函数将genres列中的数据类型转换为列表
original_titles["genres"] = original_titles["genres"].apply(lambda x: eval(x))
original_titles["genres"][1]

['drama', 'crime']

此时，可以看到`genres`列中的元素已经转换成列表了，现在可以运用explode函数将列表拆分成单独的行。

In [8]:
#对genres列中的列表拆分成单独的行
original_titles = original_titles.explode("genres")
original_titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,['US'],,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,['US'],,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,['US'],,tt0068473,7.7,107673.0,10.01,7.3


针对`production_countries`列也是一样的流程。

### 拆分production_countries列

In [9]:
#查看production_countries列的格式
original_titles["production_countries"][0]

"['US']"

In [10]:
#对production_countries中的列表也进行转换和拆分
original_titles["production_countries"] = original_titles["production_countries"].apply(lambda x: eval(x))
original_titles = original_titles.explode("production_countries")
original_titles.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,documentation,US,1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.01,7.3


In [11]:
#抽取original_credits头5行数据
original_credits.head()

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


由此可知，`original_credits`不存在结构性问题，因而不需要进行处理。

### 内容性问题—数据格式转换、缺失值、重复数据、不一致与错误数据的处理

### 数据格式转换

In [12]:
#查看original_titles数据内容的基本情况
original_titles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17818 non-null  object 
 1   title                 17817 non-null  object 
 2   type                  17818 non-null  object 
 3   description           17790 non-null  object 
 4   release_year          17818 non-null  int64  
 5   age_certification     10889 non-null  object 
 6   runtime               17818 non-null  int64  
 7   genres                17755 non-null  object 
 8   production_countries  17439 non-null  object 
 9   seasons               6224 non-null   float64
 10  imdb_id               17116 non-null  object 
 11  imdb_score            16976 non-null  float64
 12  imdb_votes            16945 non-null  float64
 13  tmdb_popularity       17663 non-null  float64
 14  tmdb_score            17241 non-null  float64
dtypes: float64(5), int64(2), 

从输出结果来看，`cleaned_titles`数据共有17818条观察值，`title`、`description`、`age_certification`、`genres`、`production_countries`、`seasons、imdb_id`、`imdb_score`、`imdb_votes`、`tmdb_popularity`、`tmdb_score`变量均存在缺失值，将在后续进行评估和清理。

此外，`release_year`表示年份，数据类型不应为数字，应为日期，所以需要进行数据格式转换。

In [13]:
#将release_year数据格式转换成日期
original_titles["release_year"] = pd.to_datetime(original_titles["release_year"], format='%Y')
original_titles["release_year"]

0      1945-01-01
1      1976-01-01
1      1976-01-01
2      1972-01-01
2      1972-01-01
          ...    
5847   2021-01-01
5848   2021-01-01
5849   2021-01-01
5849   2021-01-01
5849   2021-01-01
Name: release_year, Length: 17818, dtype: datetime64[ns]

In [14]:
#查看original_credits数据内容的基本情况
original_credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


从输出结果来看，`person_id`的数据类型应该为字符串，而不是整数，应该对其进行数据格式转换。

In [15]:
##将person_id数据格式转换成字符串
original_credits["person_id"] = original_credits["person_id"].astype("str")
original_credits["person_id"]

0           3748
1          14658
2           7064
3           3739
4          48933
          ...   
77796     736339
77797     399499
77798     373198
77799     378132
77800    1950416
Name: person_id, Length: 77801, dtype: object

### 处理缺失数据

从上述分析可以得到，`original_titles`中的`title`、`description`、`age_certification`、`genres`、`production_countries`、`seasons`、`imdb_id`、`imdb_score`、`imdb_votes`、`tmdb_popularity`、`tmdb_score`列存在缺失值。

但本次数据处理的目标是，找出各个流派中高评分作品的演员，因而`genres`、`imdb_score`存在缺失值会影响我们后续对数据进行分析，因而需要首先对其进行评估和处理。

In [16]:
#查看imdb_score为空缺值的数据
original_titles[original_titles["imdb_score"].isnull()]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945-01-01,TV-MA,51,documentation,US,1.0,,,,0.600,
75,tm132164,Bill Hicks: Sane Man,MOVIE,Sane Man was filmed before Bill recorded ‘Dang...,1989-01-01,R,80,comedy,US,,,,,3.377,7.5
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,documentation,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,family,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Otsukai” (First Errand) is a Japa...,1991-01-01,TV-G,18,reality,JP,12.0,,,,7.730,7.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5810,tm1225897,Social Man,MOVIE,Two competitive social media Influencers go he...,2021-01-01,,96,drama,,,tt20198164,,,,
5833,ts307884,HQ Barbers,SHOW,When a family run barber shop in the heart of ...,2021-01-01,TV-14,24,comedy,NG,1.0,,,,0.840,
5840,tm1216735,Sun of the Soil,MOVIE,"In 14th-century Mali, an ambitious young royal...",2022-01-01,,26,,,,,,,1.179,7.0
5844,tm1074617,Bling Empire - The Afterparty,MOVIE,"The stars of ""Bling Empire"" discuss the show's...",2021-01-01,,35,,US,,,,,,


由于缺失分析所需的核心数据`imdb_score`，我们将把这些观察值删除，并查看删除后该列空缺值个数和

In [17]:
#删除imdb_score为空缺值的数据
original_titles = original_titles.dropna(subset="imdb_score")
original_titles["imdb_score"].isnull().sum()

0

In [18]:
#查看genres为空缺值的数据
original_titles[original_titles["genres"].isnull()]

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1813,ts77824,My Next Guest Needs No Introduction With David...,SHOW,TV legend David Letterman teams up with fascin...,2018-01-01,TV-MA,50,,US,4.0,tt7829834,7.8,5581.0,8.217,7.6
1939,ts215037,Minecraft: Story Mode,SHOW,"MInecraft: Story Mode is an interactive, anima...",2018-01-01,TV-PG,52,,US,1.0,tt10498322,5.6,347.0,,
2386,ts74805,A Little Help with Carol Burnett,SHOW,In this unscripted series starring comedy lege...,2018-01-01,TV-G,24,,US,1.0,tt7204366,6.3,237.0,1.621,6.2
2658,ts265844,#ABtalks,SHOW,#ABtalks is a YouTube interview show hosted by...,2018-01-01,TV-PG,68,,,1.0,tt12635254,9.6,7.0,,
4274,tm1172010,The Lockdown Plan,MOVIE,,2020-01-01,,49,,,,tt13079112,6.5,,,
4648,tm1113921,In Vitro,MOVIE,'In Vitro' is an otherworldly rumination on me...,2019-01-01,,27,,,,tt10545994,7.7,,,


In [19]:
#删除genres为空缺值的数据
original_titles = original_titles.dropna(subset="genres")
original_titles["genres"].isnull().sum()

0

### 处理重复数据

对于`original_titles`和`original_credits`中的重复数据，会对后续的数据分析产生影响，因而应该对其中的重复数据进行评估和处理。

In [20]:
#查看original_titles中重复的数据
original_titles.duplicated().sum()

0

In [21]:
#查看original_credits中重复的数据
original_credits.duplicated().sum()

0

由上述的输出结果可知，`original_titles`和`original_credits`中都不存在重复数据，因而不需要进行处理。

### 处理不一致数据

此次数据分析主要找出各个流派中高评分作品的演员，`genres`中存在不一致的数据会产生比较大的影响，应该先对其进行评估。

In [22]:
#评估original_titles中genres的不一致数据
original_titles["genres"].value_counts()

genres
drama            3357
comedy           2419
thriller         1446
action           1339
romance          1080
crime            1066
documentation     981
family            769
animation         732
fantasy           727
european          679
scifi             647
horror            438
history           336
music             266
reality           226
war               221
sport             188
western            53
Name: count, dtype: int64

从上面看出，`genres`列里并不存在不一致数据，各个值都在指代不同的流派，因而可以不用进行处理。

此外，`production_countries`中可能存在不一致的值，应该对其进行评估。

In [23]:
#评估original_titles中production_countries的不一致数据
original_titles["production_countries"].value_counts()

production_countries
US    5648
IN    1610
GB    1068
JP    1046
FR     720
      ... 
GT       1
CU       1
LK       1
NP       1
FO       1
Name: count, Length: 108, dtype: int64

由于value_counts执行结果中有太多值，Pandas只会默认显示开头和结尾的一些值。要完整展示结果，可以把display.max_rows设置为None，即取消展示行数上限。

但因为我们只是在当前调用value_counts时才需要看完整结果，所以可以结合option_context，只更改临时上限。

In [24]:
with pd.option_context('display.max_rows', None):
    print(original_titles['production_countries'].value_counts())

production_countries
US         5648
IN         1610
GB         1068
JP         1046
FR          720
KR          637
ES          637
CA          608
DE          383
CN          295
MX          264
IT          224
BR          221
AU          217
TR          195
PH          192
AR          150
ID          149
BE          148
TW          133
NG          131
PL          126
ZA          103
NL          102
HK          102
CO           94
EG           93
DK           89
TH           87
SE           81
LB           70
NO           68
AE           52
IE           49
SG           47
XX           43
IL           42
RU           41
CL           35
CH           33
PS           32
BG           31
MY           30
SA           28
AT           28
IS           28
LU           27
NZ           27
PE           26
RO           25
QA           24
CZ           22
JO           19
FI           18
HU           18
UY           15
MA           15
PT           14
KH           10
KW           10
PR            9
PK 

从以上输出结果来看，出品国家都用两位的国家代码来表示，除了里面存在一个的`Lebanon`值。

`Lebanon`的国家代码是`LB`，出现了39次，说明此处数据不一致。`LB`和`Lebanon`都在表示同一国家，需要进行统一。

In [25]:
# 对production_countries列中的Lebanon替换成LB，并计算替换后Lebanon的数量
original_titles["production_countries"] = original_titles["production_countries"].replace("Lebanon", "LB")
print(len(original_titles[original_titles["production_countries"] == "Lebanon"]))

0


针对`original_credits`，不一致数据可能存在于`role`中，我们将查看是否存在多个不同值指代同一演职员类型的情况。

In [26]:
#查看original_credits中role列的不一致值
original_credits['role'].value_counts()

role
ACTOR       73251
DIRECTOR     4550
Name: count, dtype: int64

从以上输出结果来看，role只有两种可能的值，ACTOR或DIRECTOR，不存在不一致数据。

我们可以把这列的类型转换为Category，好处是比字符串类型更节约内存空间，也能表明说值的类型有限。

In [27]:
original_credits["role"] = original_credits["role"].astype("category")
original_credits["role"]

0           ACTOR
1           ACTOR
2           ACTOR
3           ACTOR
4           ACTOR
           ...   
77796       ACTOR
77797       ACTOR
77798       ACTOR
77799       ACTOR
77800    DIRECTOR
Name: role, Length: 77801, dtype: category
Categories (2, object): ['ACTOR', 'DIRECTOR']

### 处理错误或无效数据

In [28]:
#对original_titles进行描述性统计
original_titles.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,16970,16970.0,5954.0,16970.0,16941.0,16842.0,16515.0
mean,2015-11-14 22:42:51.974072064,80.912552,2.455492,6.514207,32816.55,29.396307,6.846933
min,1954-01-01 00:00:00,0.0,1.0,1.5,5.0,0.6,1.0
25%,2015-01-01 00:00:00,45.0,1.0,5.8,780.0,4.07,6.2
50%,2018-01-01 00:00:00,90.0,2.0,6.6,3508.0,10.195,6.9
75%,2020-01-01 00:00:00,107.0,3.0,7.3,16978.0,23.639,7.5
max,2022-01-01 00:00:00,225.0,42.0,9.5,2294231.0,2274.044,10.0
std,,39.596172,2.869428,1.131095,114149.2,93.178235,1.078831


从以上统计信息来看，`original_titles`里不存在脱离现实意义的数值。

`original_credits`由于不包含表示数值含义的变量，因此无需用`describe`检查。

### 保存已清理数据

In [29]:
#保存清理后的数据
original_titles.to_csv("cleaned_titles.csv", index=False)
original_credits.to_csv("cleaned_credits.csv", index=False)

## 整理数据

In [30]:
#打开清理后的数据
clean_titles = pd.read_csv("/Users/hardy/Desktop/Python_file/Python_DataAnalyst/Data_Organize/Project_Organize/cleaned_titles.csv")
clean_credits = pd.read_csv("/Users/hardy/Desktop/Python_file/Python_DataAnalyst/Data_Organize/Project_Organize/cleaned_credits.csv")

In [31]:
clean_titles

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
3,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
4,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972-01-01,R,109,thriller,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16965,tm898842,C/O Kaadhal,MOVIE,A heart warming film that explores the concept...,2021-01-01,,134,drama,,,tt11803618,7.7,348.0,,
16966,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
16967,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021-01-01,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
16968,ts271048,Mighty Little Bheem: Kite Festival,SHOW,"With winter behind them, Bheem and his townspe...",2021-01-01,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


In [32]:
clean_credits

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


### 合并数据

In [33]:
#基于id（影视作品ID）将两个dataframe合并起来
titles_with_credits = pd.merge(clean_titles, clean_credits, on = "id", how = "inner")

In [34]:
titles_with_credits

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role
0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3748,Robert De Niro,Travis Bickle,ACTOR
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,14658,Jodie Foster,Iris Steensma,ACTOR
2,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,7064,Albert Brooks,Tom,ACTOR
3,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,3739,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179,48933,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276104,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,736339,Adelaida Buscato,María Paz,ACTOR
276105,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,399499,Luz Stella Luengas,Karen Bayona,ACTOR
276106,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,373198,Inés Prieto,Fanny,ACTOR
276107,tm1059008,Lokillo,MOVIE,A controversial TV host and comedian who has b...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300,378132,Isabel Gaona,Cacica,ACTOR


### 筛选需要的数据

由于我们只对挖掘演员的参演作品口碑感兴趣，导演不在我们的分析范围内，因此根据`role`，筛选出类型为`ACTOR`的观察值，供后续分析。

In [42]:
#筛选出role为ACTOR的数据并检查是否筛选完毕
titles_with_credits = titles_with_credits[titles_with_credits["role"] == "ACTOR"]
titles_with_credits.query('role != "ACTOR"')

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score,person_id,name,character,role


### 分组聚合操作

为了挖掘出各个流派中的高评分作品演员，我们应该对`genres`和`person_id`进行分组，再对`imdb_score`进行求平均值（即求该演员在该题材中的平均水平）的操作。

注：对演员进行分组的时候，选择的是用`person_id`而不是`name`变量，原因是名字容易出现错拼或者重名的情况，演职员ID会比演员姓名更加准确地反映是哪位演员。

In [44]:
#对titles_with_credits进行分组并求平均值
groupby_genres_and_person_id = titles_with_credits.groupby(["genres", "person_id"])["imdb_score"].mean()
groupby_genres_and_person_id

genres   person_id
action   45           5.0
         48           5.4
         51           6.4
         53           6.8
         54           5.3
                     ... 
western  2353339      6.9
         2370848      6.1
         2398539      3.8
         2406218      6.0
         2408082      7.3
Name: imdb_score, Length: 168881, dtype: float64

### 重置索引

In [46]:
#对上述输出结果的索引进行重置
groupby_genres_and_person_id_reset = groupby_genres_and_person_id.reset_index()
groupby_genres_and_person_id_reset

Unnamed: 0,genres,person_id,imdb_score
0,action,45,5.0
1,action,48,5.4
2,action,51,6.4
3,action,53,6.8
4,action,54,5.3
...,...,...,...
168876,western,2353339,6.9
168877,western,2370848,6.1
168878,western,2398539,3.8
168879,western,2406218,6.0


In [51]:
#求每个题材中最高的平均评分
groupby_genres_and_person_id_max = groupby_genres_and_person_id_reset.groupby("genres")["imdb_score"].max()
groupby_genres_and_person_id_max

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
drama            9.5
european         8.9
family           9.3
fantasy          9.3
history          9.1
horror           9.0
music            8.8
reality          8.9
romance          9.2
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, dtype: float64

### 再合并找出所需信息

此时应该再对上面的groupby_genres_and_person_id_max和groupby_genres_and_person_id_reset，通过`imdb_score`进行合并，就可以得到相应评分演员的`person_id`了

In [64]:
#对groupby_genres_and_person_id_max和groupby_genres_and_person_id_reset通过imdb_score进行合并
genres_max_scores = pd.merge(groupby_genres_and_person_id_max, groupby_genres_and_person_id_reset, on=["genres", "imdb_score"], how="inner")
genres_max_scores

Unnamed: 0,genres,imdb_score,person_id
0,action,9.3,1303
1,action,9.3,12790
2,action,9.3,21033
3,action,9.3,86591
4,action,9.3,336830
...,...,...,...
131,war,8.8,826547
132,western,8.9,22311
133,western,8.9,28166
134,western,8.9,28180


在知道平均最高分和演员编号的对应表之后，我们需要将`person_id`和`name`连接起来，因而应该从最初的clean_credits提取`person_id`和`name`的信息，并剔除掉重复值。

In [65]:
person_id_with_names = clean_credits[["person_id", "name"]].drop_duplicates()
person_id_with_names

Unnamed: 0,person_id,name
0,3748,Robert De Niro
1,14658,Jodie Foster
2,7064,Albert Brooks
3,3739,Harvey Keitel
4,48933,Cybill Shepherd
...,...,...
77796,736339,Adelaida Buscato
77797,399499,Luz Stella Luengas
77798,373198,Inés Prieto
77799,378132,Isabel Gaona


由此将person_id_with_names和genres_max_scores进行合并，即可得到最高平均最高分对应的演员编号和名字。

In [81]:
#对person_id_with_names和genres_max_scores通过person_id进行合并
genres_max_scores_with_name = pd.merge(genres_max_scores, person_id_with_names, on="person_id", how="inner")
genres_max_scores_with_name

Unnamed: 0,genres,imdb_score,person_id,name
0,action,9.3,1303,Jessie Flower
1,animation,9.3,1303,Jessie Flower
2,family,9.3,1303,Jessie Flower
3,fantasy,9.3,1303,Jessie Flower
4,scifi,9.3,1303,Jessie Flower
...,...,...,...,...
131,war,8.8,826547,Yuto Uemura
132,western,8.9,22311,Koichi Yamadera
133,western,8.9,28166,Megumi Hayashibara
134,western,8.9,28180,Unsho Ishizuka


为了把相同流派都排序在一起，我们还可以用`sort_values`方法，把结果里面的行根据`genres`进行排序，然后用`reset_index`把索引重新排序。

索引重新排序后，DataFrame会多出`index`一列，我们可以再把`index`列进行删除。

In [82]:
genres_max_scores_with_name = genres_max_scores_with_name.sort_values("genres").reset_index().drop("index", axis=1)
genres_max_scores_with_name

Unnamed: 0,genres,imdb_score,person_id,name
0,action,9.3,1303,Jessie Flower
1,action,9.3,86591,Cricket Leigh
2,action,9.3,21033,Zach Tyler
3,action,9.3,12790,Olivia Hack
4,action,9.3,336830,André Sogliuzzo
...,...,...,...,...
131,war,8.8,826547,Yuto Uemura
132,western,8.9,28166,Megumi Hayashibara
133,western,8.9,28180,Unsho Ishizuka
134,western,8.9,22311,Koichi Yamadera
