# 项目：整理Netflix电影演员评分数据

## 分析目标

此数据分析的目的是，整理不同流派影视作品，比如喜剧片、动作片、科幻片中，各演员出演作品的平均IMDB评分，从而挖掘出各个流派中的高评分作品演员。

本实战项目的目的在于练习整理数据，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了截止至2022年7月美国地区可观看的所有Netflix电视剧及电影数据。数据集包含两个数据表：`titles.csv`和`credits.csv`。

`titles.csv`包含电影及电视剧相关信息，包括影视作品ID、标题、类型、描述、流派、IMDB（一个国外的在线评分网站）评分，等等。`credits.csv`包含超过7万名出现在Netflix影视作品的导演及演员信息，包括名字、影视作品ID、人物名、演职员类型（导演/演员）等。

`titles.csv`每列的含义如下：
- id：影视作品ID。
- title：影视作品标题。
- show_type：作品类型，电视节目或电影。
- description：简短描述。
- release_year：发布年份。
- age_certification：适龄认证。
- runtime：每集电视剧或电影的长度。
- genres：流派类型列表。
- production_countries：出品国家列表。
- seasons：如果是电视剧，则是季数。
- imdb_id：IMDB的ID。
- imdb_score：IMDB的评分。
- imdb_votes：IMDB的投票数。
- tmdb_popularity：TMDB的流行度。
- tmdb_score：TMDB的评分。

`credits.csv`每列的含义如下：
- person_ID：演职员ID。
- id：参与的影视作品ID。
- name：姓名。
- character_name：角色姓名。
- role：演职员类型，演员或导演。

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns",50)
pd.set_option("display.max_colwidth",20)

In [2]:
df1 = pd.read_csv("titles.csv")
df1.head(20)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: ...,SHOW,This collection ...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,"['drama', 'actio...",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and...,MOVIE,"King Arthur, acc...",1975,PG,91,"['fantasy', 'act...",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American mili...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6
5,ts22164,Monty Python's F...,SHOW,A British sketch...,1969,TV-14,30,"['comedy', 'euro...",['GB'],4.0,tt0063929,8.8,73424.0,17.617,8.306
6,tm70993,Life of Brian,MOVIE,Brian Cohen is a...,1979,R,94,['comedy'],['GB'],,tt0079470,8.0,395024.0,17.77,7.8
7,tm14873,Dirty Harry,MOVIE,When a madman du...,1971,R,102,"['thriller', 'ac...",['US'],,tt0066999,7.7,155051.0,12.817,7.5
8,tm119281,Bonnie and Clyde,MOVIE,"In the 1930s, bo...",1967,R,110,"['crime', 'drama...",['US'],,tt0061418,7.7,112048.0,15.687,7.5
9,tm98978,The Blue Lagoon,MOVIE,Two small childr...,1980,R,104,"['romance', 'act...",['US'],,tt0080453,5.8,69844.0,50.324,6.156


In [3]:
df2 = pd.read_csv("credits.csv")
df2.sample(20)

Unnamed: 0,person_id,id,name,character,role
10081,296144,tm44378,Sergio Monge,Tony 1,ACTOR
77210,2344967,tm863829,Amalia Suryani,Indonesian Friend 2,ACTOR
45135,654771,tm817504,Earnestine Phillips,Esther,ACTOR
38176,279919,tm358666,Idit Teperson,"Malka, Rabbi's wife",ACTOR
68105,852365,tm413169,Will Reichelt,JW Rooster III (...,ACTOR
59504,1289880,tm873159,Ju Xiaowen,The Eye Demon,ACTOR
27841,60457,tm232797,Lee Armstrong,Grenadier,ACTOR
62125,1154847,tm810357,Harold Harris,,DIRECTOR
63405,5271,tm942492,Vincent Tong,Drac Shadows & N...,ACTOR
34103,374905,tm420334,Amber Hodgkiss,Ginny,ACTOR


**评估数据**

#### 结构性问题

#### 某些列如作品的类别是一个列表，存在多值，应该是要分行的在title里

In [4]:
df1["genres"][1]  #可以见到表面是一个列表，实际是一个字符串，这里会用到eval函数，传入字符串会执行字符串内的内容

"['drama', 'crime']"

In [5]:
df1["genres"] = df1["genres"].apply(lambda s : eval(s))  #相当于去掉外面那层引号

In [6]:
df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: ...,SHOW,This collection ...,1945,TV-MA,51,[documentation],['US'],1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,"[drama, crime]",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,"[drama, action, ...",['US'],,tt0068473,7.7,107673.0,10.010,7.300
3,tm127384,Monty Python and...,MOVIE,"King Arthur, acc...",1975,PG,91,"[fantasy, action...",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American mili...,1967,,150,"[war, action]","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.600
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5845,tm1014599,Fine Wine,MOVIE,A beautiful love...,2021,,100,"[romance, drama]",['NG'],,tt13857480,6.8,45.0,1.466,
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming ...,2021,,134,[drama],[],,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial ...,2021,,90,[comedy],['CO'],,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarra...,MOVIE,"Jamie Foxx, Davi...",2021,PG-13,37,[],['US'],,,,,1.296,10.000


In [7]:
#### 拆行用explode方法

In [8]:
df1 = df1.explode("genres")
df1.head(10)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: ...,SHOW,This collection ...,1945,TV-MA,51,documentation,['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,drama,['US'],,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,crime,['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,drama,['US'],,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,action,['US'],,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,thriller,['US'],,tt0068473,7.7,107673.0,10.01,7.3
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,european,['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and...,MOVIE,"King Arthur, acc...",1975,PG,91,fantasy,['GB'],,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and...,MOVIE,"King Arthur, acc...",1975,PG,91,action,['GB'],,tt0071853,8.2,534486.0,15.461,7.811
3,tm127384,Monty Python and...,MOVIE,"King Arthur, acc...",1975,PG,91,comedy,['GB'],,tt0071853,8.2,534486.0,15.461,7.811


In [9]:
#production_countries 也是需要拆的
df1["production_countries"][0]

"['US']"

In [10]:
df1["production_countries"] = df1["production_countries"].apply(lambda s : eval(s))

In [11]:
df1["production_countries"][0]

['US']

In [12]:
df1 = df1.explode("production_countries")
df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: ...,SHOW,This collection ...,1945,TV-MA,51,documentation,US,1.0,,,,0.600,
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,tm1059008,Lokillo,MOVIE,A controversial ...,2021,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5848,tm1035612,Dad Stop Embarra...,MOVIE,"Jamie Foxx, Davi...",2021,PG-13,37,,US,,,,,1.296,10.000
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


In [13]:
#  title文件处理完了结构性问题，我们来看credit的  credit符合每行一个观察值，每列一个变量，并且单元格只有一个变量的要求

In [14]:
df2

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


### 数据干净度

In [15]:
#titile文件的数据干净度我们来查看一下

In [16]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17818 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    17818 non-null  object 
 1   title                 17817 non-null  object 
 2   type                  17818 non-null  object 
 3   description           17790 non-null  object 
 4   release_year          17818 non-null  int64  
 5   age_certification     10889 non-null  object 
 6   runtime               17818 non-null  int64  
 7   genres                17755 non-null  object 
 8   production_countries  17439 non-null  object 
 9   seasons               6224 non-null   float64
 10  imdb_id               17116 non-null  object 
 11  imdb_score            16976 non-null  float64
 12  imdb_votes            16945 non-null  float64
 13  tmdb_popularity       17663 non-null  float64
 14  tmdb_score            17241 non-null  float64
dtypes: float64(5), int64(2), 

从输出结果来看，`cleaned_titles`数据共有17818条观察值，`title`、`description`、`age_certification`、`genres`、`production_countries`、`seasons`、`imdb_id`、`imdb_score`、`tmdb_popularity`、`tmdb_score`、`imdb_votes`、`tmdb_popularity`、`tmdb_score`变量均存在缺失值，将在后续进行评估和清理。

此外，`release_year`表示年份，数据类型不应为数字，应为日期，所以需要进行数据格式转换。

In [17]:
df1["release_year"] = pd.to_datetime(df1["release_year"],format = '%Y')  
#代码中的 format='%Y' 表示输入的 release_year 列中的数据仅包含四位数的年份
df1["release_year"]

0      1945-01-01
1      1976-01-01
1      1976-01-01
2      1972-01-01
2      1972-01-01
          ...    
5847   2021-01-01
5848   2021-01-01
5849   2021-01-01
5849   2021-01-01
5849   2021-01-01
Name: release_year, Length: 17818, dtype: datetime64[ns]

In [18]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


从输出结果来看，`cleaned_credits`数据共有77801条观察值，其中`character`变量存在缺失值，将在后续进行评估和清理。

此外，`person_id`表示演职员ID，数据类型不应为数字，应为字符串，所以需要进行数据格式转换。

In [19]:
df2["person_id"]  = df2["person_id"].astype(str)
df2["person_id"]

0           3748
1          14658
2           7064
3           3739
4          48933
          ...   
77796     736339
77797     399499
77798     373198
77799     378132
77800    1950416
Name: person_id, Length: 77801, dtype: object

**接下来处理空缺值**

在`cleaned_titles`中，`title`、`description`、`age_certification`、`genres`、`production_countries`、`seasons`、`imdb_id`、`imdb_score`、`tmdb_popularity`、`tmdb_score`、`imdb_votes`、`tmdb_popularity`、`tmdb_score`变量存在缺失值。

由于影视作品的标题、描述、适龄认证、发行国家、电视剧季数、IMDB的ID、TMDB的流行度、TMDB的评分，并不影响我们挖掘各个流派中的高IMDB评分作品演员，所以可以保留`title`、`description`、`age_certification`、`production_countries`、`seasons`、`imdb_id`、`tmdb_popularity`、`tmdb_score`、`imdb_votes`、`tmdb_popularity`、`tmdb_score`变量值存在空缺的观察值。

但`imdb_score`和`genres`，即IMDB评分和流派，和我们后续要做的分析息息相关。

先提取出`imdb_score`缺失观察值进行查看。

In [20]:
## 因为我们整个是要获取  所有流派中评分最高的角色，因此，角色、评分、流派才是最重要的

In [21]:
df1.query("imdb_score.isnull()")  #记住了query也是筛选行，但是里面是整个字符串，列名不需要再给影号了

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: ...,SHOW,This collection ...,1945-01-01,TV-MA,51,documentation,US,1.0,,,,0.600,
75,tm132164,Bill Hicks: Sane...,MOVIE,Sane Man was fil...,1989-01-01,R,80,comedy,US,,,,,3.377,7.5
145,ts251477,My First Errand,SHOW,“Hajimete no Ots...,1991-01-01,TV-G,18,documentation,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Ots...,1991-01-01,TV-G,18,family,JP,12.0,,,,7.730,7.8
145,ts251477,My First Errand,SHOW,“Hajimete no Ots...,1991-01-01,TV-G,18,reality,JP,12.0,,,,7.730,7.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5810,tm1225897,Social Man,MOVIE,Two competitive ...,2021-01-01,,96,drama,,,tt20198164,,,,
5833,ts307884,HQ Barbers,SHOW,When a family ru...,2021-01-01,TV-14,24,comedy,NG,1.0,,,,0.840,
5840,tm1216735,Sun of the Soil,MOVIE,In 14th-century ...,2022-01-01,,26,,,,,,,1.179,7.0
5844,tm1074617,Bling Empire - T...,MOVIE,"The stars of ""Bl...",2021-01-01,,35,,US,,,,,,


In [22]:
#由于没有平分，直接删掉

In [23]:
df1.dropna(subset = ["imdb_score"],inplace = True)

In [24]:
df1["imdb_score"].isnull().sum()

np.int64(0)

In [25]:
df1.query("imdb_score.isnull()")  #可见没有所谓的imdb评分为0的电视剧或电影了

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score


In [26]:
len(df1.query("genres.isnull()"))   #流派显示空值的有六行

6

In [27]:
df1.query("genres.isnull()")

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1813,ts77824,My Next Guest Ne...,SHOW,TV legend David ...,2018-01-01,TV-MA,50,,US,4.0,tt7829834,7.8,5581.0,8.217,7.6
1939,ts215037,Minecraft: Story...,SHOW,MInecraft: Story...,2018-01-01,TV-PG,52,,US,1.0,tt10498322,5.6,347.0,,
2386,ts74805,A Little Help wi...,SHOW,In this unscript...,2018-01-01,TV-G,24,,US,1.0,tt7204366,6.3,237.0,1.621,6.2
2658,ts265844,#ABtalks,SHOW,#ABtalks is a Yo...,2018-01-01,TV-PG,68,,,1.0,tt12635254,9.6,7.0,,
4274,tm1172010,The Lockdown Plan,MOVIE,,2020-01-01,,49,,,,tt13079112,6.5,,,
4648,tm1113921,In Vitro,MOVIE,'In Vitro' is an...,2019-01-01,,27,,,,tt10545994,7.7,,,


In [28]:
df1.dropna(subset=["genres"],inplace = True)

In [29]:
df1["genres"].isnull().sum()

np.int64(0)

In [30]:
df1[ df1["genres"].isnull() ]  #等于  df1.query("genres.isnull()")

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score


**重复数据**

In [31]:
## 先思考，第一个表df1是表示所有电视剧电影的信息，名字那些重复无所谓呀，因为它有的是根据类别又分了行，而且根据类别分了行
## imdb的分数也不会变呀，分成了两行，imdb分数也会是一样的，只是在这个类别是7分，那个类别也是7分，根据duplicated这个方法
## 对df进行查找重复行是，是找完全相同的行。可以直接使用

In [32]:
df1.duplicated().sum()

np.int64(0)

In [33]:
df2

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


In [34]:
df2.duplicated().sum()

np.int64(0)

**不一致数据**

In [35]:
df1["genres"].value_counts()  #里面所有流派都没有说重复一个意思的流派

genres
drama            3357
comedy           2419
thriller         1446
action           1339
romance          1080
crime            1066
documentation     981
family            769
animation         732
fantasy           727
european          679
scifi             647
horror            438
history           336
music             266
reality           226
war               221
sport             188
western            53
Name: count, dtype: int64

In [36]:
"" in df1["genres"].values   #再确认里面是不是有漏网之鱼！可恶，谁会用空的字符串来迷惑我！

False

In [37]:
df1["production_countries"].value_counts()  #像这里因为国家太多了，展示不全，我们需要调用`display.max_rows`设置为`None`，就是无上限

production_countries
US    5648
IN    1610
GB    1068
JP    1046
FR     720
      ... 
CU       1
LK       1
GT       1
AF       1
FO       1
Name: count, Length: 108, dtype: int64

但因为我们只是在当前调用`value_counts`时才需要看完整结果，所以可以结合`option_context`，只更改临时上限。

In [38]:
with pd.option_context("display.max_rows",None):  #出来的结果交给deepseek帮忙看吧，忒多了！
    print(df1["production_countries"].value_counts())

production_countries
US         5648
IN         1610
GB         1068
JP         1046
FR          720
ES          637
KR          637
CA          608
DE          383
CN          295
MX          264
IT          224
BR          221
AU          217
TR          195
PH          192
AR          150
ID          149
BE          148
TW          133
NG          131
PL          126
ZA          103
HK          102
NL          102
CO           94
EG           93
DK           89
TH           87
SE           81
LB           70
NO           68
AE           52
IE           49
SG           47
XX           43
IL           42
RU           41
CL           35
CH           33
PS           32
BG           31
MY           30
SA           28
IS           28
AT           28
NZ           27
LU           27
PE           26
RO           25
QA           24
CZ           22
JO           19
FI           18
HU           18
UY           15
MA           15
PT           14
KH           10
KW           10
PR            9
PK 

从以上输出结果来看，出品国家都用两位的国家代码来表示，除了里面存在一个的`Lebanon`值。

`Lebanon`的国家代码是`LB`，出现了39次，说明此处数据不一致。`LB`和`Lebanon`都在表示同一国家，需要进行统一。

In [39]:
df1["production_countries"] = df1["production_countries"].replace({"Lebanon":"LB"})  #.str.replace和series的replace还是有区别的

In [40]:
with pd.option_context("display.max_rows",None):
    print( "" in df1["production_countries"].values)
    print( df1["production_countries"].value_counts())

False
production_countries
US    5648
IN    1610
GB    1068
JP    1046
FR     720
ES     637
KR     637
CA     608
DE     383
CN     295
MX     264
IT     224
BR     221
AU     217
TR     195
PH     192
AR     150
ID     149
BE     148
TW     133
NG     131
PL     126
ZA     103
HK     102
NL     102
CO      94
EG      93
DK      89
TH      87
SE      81
LB      71
NO      68
AE      52
IE      49
SG      47
XX      43
IL      42
RU      41
CL      35
CH      33
PS      32
BG      31
MY      30
SA      28
AT      28
IS      28
NZ      27
LU      27
PE      26
RO      25
QA      24
CZ      22
JO      19
FI      18
HU      18
MA      15
UY      15
PT      14
KW      10
KH      10
PK       9
PR       9
UA       8
VN       8
MT       8
SU       7
CD       7
TN       7
LT       7
IR       7
GH       6
SN       6
AL       6
KE       6
IQ       5
MU       5
CY       5
TZ       4
SY       4
MC       4
IO       4
KN       4
GR       4
BD       3
BS       3
DZ       3
GL       3
AO       3
CM   

In [41]:
# 国家也可能存在不一致，但是这并不影响我们此次的分析目标

In [42]:
df2

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


In [43]:
df2["role"]   #类型为object，但应转为category，因为变量值有限才两个，字符串占用的空间更大，category可以节约内存空间

0           ACTOR
1           ACTOR
2           ACTOR
3           ACTOR
4           ACTOR
           ...   
77796       ACTOR
77797       ACTOR
77798       ACTOR
77799       ACTOR
77800    DIRECTOR
Name: role, Length: 77801, dtype: object

In [44]:
df2["role"].value_counts()

role
ACTOR       73251
DIRECTOR     4550
Name: count, dtype: int64

In [45]:
# 既然演员信息表中的role为演员和导演两个，那么就是有限数量的变量，我们可以转成category种类

In [46]:
df2["role"] = df2["role"].astype("category")

In [47]:
df2["role"]

0           ACTOR
1           ACTOR
2           ACTOR
3           ACTOR
4           ACTOR
           ...   
77796       ACTOR
77797       ACTOR
77798       ACTOR
77799       ACTOR
77800    DIRECTOR
Name: role, Length: 77801, dtype: category
Categories (2, object): ['ACTOR', 'DIRECTOR']

**处理无效数据**

In [48]:
df1.describe()  #可以看到不存在脱离实际意义的值，都是在正常范围内

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,16970,16970.0,5954.0,16970.0,16941.0,16842.0,16515.0
mean,2015-11-14 22:42...,80.912552,2.455492,6.514207,32816.55,29.396307,6.846933
min,1954-01-01 00:00:00,0.0,1.0,1.5,5.0,0.6,1.0
25%,2015-01-01 00:00:00,45.0,1.0,5.8,780.0,4.07,6.2
50%,2018-01-01 00:00:00,90.0,2.0,6.6,3508.0,10.195,6.9
75%,2020-01-01 00:00:00,107.0,3.0,7.3,16978.0,23.639,7.5
max,2022-01-01 00:00:00,225.0,42.0,9.5,2294231.0,2274.044,10.0
std,,39.596172,2.869428,1.131095,114149.2,93.178235,1.078831


In [49]:
df2  #df2中的东西全是字符串类型，我们在这个表里重点关注演员名字和影视id，都没有无效的数据

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


**保存数据**

In [50]:
df1

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,thriller,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming ...,2021-01-01,,134,drama,,,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021-01-01,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021-01-01,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


In [51]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16970 entries, 1 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    16970 non-null  object        
 1   title                 16970 non-null  object        
 2   type                  16970 non-null  object        
 3   description           16965 non-null  object        
 4   release_year          16970 non-null  datetime64[ns]
 5   age_certification     10506 non-null  object        
 6   runtime               16970 non-null  int64         
 7   genres                16970 non-null  object        
 8   production_countries  16670 non-null  object        
 9   seasons               5954 non-null   float64       
 10  imdb_id               16970 non-null  object        
 11  imdb_score            16970 non-null  float64       
 12  imdb_votes            16941 non-null  float64       
 13  tmdb_popularity       

In [52]:
df1.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,16970,16970.0,5954.0,16970.0,16941.0,16842.0,16515.0
mean,2015-11-14 22:42...,80.912552,2.455492,6.514207,32816.55,29.396307,6.846933
min,1954-01-01 00:00:00,0.0,1.0,1.5,5.0,0.6,1.0
25%,2015-01-01 00:00:00,45.0,1.0,5.8,780.0,4.07,6.2
50%,2018-01-01 00:00:00,90.0,2.0,6.6,3508.0,10.195,6.9
75%,2020-01-01 00:00:00,107.0,3.0,7.3,16978.0,23.639,7.5
max,2022-01-01 00:00:00,225.0,42.0,9.5,2294231.0,2274.044,10.0
std,,39.596172,2.869428,1.131095,114149.2,93.178235,1.078831


In [53]:
cleaned_title = df1.copy()

In [54]:
df2

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


In [55]:
df2.describe()

Unnamed: 0,person_id,id,name,character,role
count,77801,77801,77801,68029,77801
unique,54589,5489,54314,47274,2
top,38636,tm32982,Kareena Kapoor Khan,Self,ACTOR
freq,25,208,25,1950,73251


In [56]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   person_id  77801 non-null  object  
 1   id         77801 non-null  object  
 2   name       77801 non-null  object  
 3   character  68029 non-null  object  
 4   role       77801 non-null  category
dtypes: category(1), object(4)
memory usage: 2.4+ MB


In [57]:
cleaned_credit = df2.copy()

In [58]:
#  cleaned_title 和 cleaned_credit

# 整理数据

In [59]:
cleaned_title 

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,tm84618,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,drama,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,action,US,,tt0068473,7.7,107673.0,10.010,7.300
2,tm154986,Deliverance,MOVIE,Intent on seeing...,1972-01-01,R,109,thriller,US,,tt0068473,7.7,107673.0,10.010,7.300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5846,tm898842,C/O Kaadhal,MOVIE,A heart warming ...,2021-01-01,,134,drama,,,tt11803618,7.7,348.0,,
5847,tm1059008,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021-01-01,,7,family,,1.0,tt13711094,7.8,18.0,2.289,10.000
5849,ts271048,Mighty Little Bh...,SHOW,With winter behi...,2021-01-01,,7,animation,,1.0,tt13711094,7.8,18.0,2.289,10.000


In [60]:
cleaned_credit

Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' ...,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


**这里我们应该思考，因为我们目标是找流派中高分的演员，我们目前有两个表格|**

**应该要通过影视作品的id去进行匹配合并，必须是两者都有的id才行，不然比如电影表里有但是演员表没有的影视作品缺少了演员**

**又或者演员表里有的影视作品表里没有这个作品那缺少流派，也是影响我们分析的，我们合并采用的方法应该是默认的how= “inner”**

In [61]:
credit_title = pd.merge(cleaned_credit,cleaned_title,on = "id",how ="inner" )

In [62]:
credit_title

Unnamed: 0,person_id,id,name,character,role,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
1,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
2,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
3,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,crime,US,,tt0075314,8.2,808582.0,40.965,8.179
4,7064,tm84618,Albert Brooks,Tom,ACTOR,Taxi Driver,MOVIE,A mentally unsta...,1976-01-01,R,114,drama,US,,tt0075314,8.2,808582.0,40.965,8.179
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
276104,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
276105,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
276106,373198,tm1059008,Inés Prieto,Fanny,ACTOR,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300
276107,378132,tm1059008,Isabel Gaona,Cacica,ACTOR,Lokillo,MOVIE,A controversial ...,2021-01-01,,90,comedy,CO,,tt14585902,3.8,68.0,26.005,6.300


In [63]:
credit_title = credit_title.query("role =='ACTOR'") #其实这一步应该在清理df2无效数据那应该得做了

In [64]:
# 一个演员（一个perosnid）可能出演很多部喜剧片，或动作片或其他，出演了很多部某个流派中的许多电影，都有评分，肯定是都求平均分呀，然后平均分比高低

In [65]:
credit_title.query("person_id == '1000'")  #可以看到1000编号这一个演员出演了很多部drama类的，肯定是求drama类这些影片他获得的平均分呀

Unnamed: 0,person_id,id,name,character,role,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
12341,1000,tm192199,Martin Sheen,Jason Wynn,ACTOR,Spawn,MOVIE,After being murd...,1997-01-01,PG-13,96,scifi,US,,tt0120177,5.2,68184.0,16.989,5.343
12342,1000,tm192199,Martin Sheen,Jason Wynn,ACTOR,Spawn,MOVIE,After being murd...,1997-01-01,PG-13,96,action,US,,tt0120177,5.2,68184.0,16.989,5.343
12343,1000,tm192199,Martin Sheen,Jason Wynn,ACTOR,Spawn,MOVIE,After being murd...,1997-01-01,PG-13,96,drama,US,,tt0120177,5.2,68184.0,16.989,5.343
12344,1000,tm192199,Martin Sheen,Jason Wynn,ACTOR,Spawn,MOVIE,After being murd...,1997-01-01,PG-13,96,horror,US,,tt0120177,5.2,68184.0,16.989,5.343
16240,1000,tm111828,Martin Sheen,Roger Strong,ACTOR,Catch Me If You Can,MOVIE,A true story abo...,2002-01-01,PG-13,141,drama,US,,tt0264464,8.1,952602.0,72.321,8.0
16241,1000,tm111828,Martin Sheen,Roger Strong,ACTOR,Catch Me If You Can,MOVIE,A true story abo...,2002-01-01,PG-13,141,crime,US,,tt0264464,8.1,952602.0,72.321,8.0
17239,1000,tm27911,Martin Sheen,Capt. Oliver Cha...,ACTOR,The Departed,MOVIE,To take down Sou...,2006-01-01,R,151,drama,US,,tt0407887,8.5,1296244.0,33.795,8.2
17240,1000,tm27911,Martin Sheen,Capt. Oliver Cha...,ACTOR,The Departed,MOVIE,To take down Sou...,2006-01-01,R,151,thriller,US,,tt0407887,8.5,1296244.0,33.795,8.2
17241,1000,tm27911,Martin Sheen,Capt. Oliver Cha...,ACTOR,The Departed,MOVIE,To take down Sou...,2006-01-01,R,151,crime,US,,tt0407887,8.5,1296244.0,33.795,8.2
17242,1000,tm27911,Martin Sheen,Capt. Oliver Cha...,ACTOR,The Departed,MOVIE,To take down Sou...,2006-01-01,R,151,action,US,,tt0407887,8.5,1296244.0,33.795,8.2


In [66]:
pd.pivot_table(credit_title,index =["genres","person_id"],values = "imdb_score",aggfunc = np.mean)   #这里因为表格里没有合适的columns我就不指定了

  pd.pivot_table(credit_title,index =["genres","person_id"],values = "imdb_score",aggfunc = np.mean)   #这里因为表格里没有合适的columns我就不指定了


Unnamed: 0_level_0,Unnamed: 1_level_0,imdb_score
genres,person_id,Unnamed: 2_level_1
action,1000,6.866667
action,100007,7.000000
action,100013,6.400000
action,100019,6.500000
action,100020,6.500000
...,...,...
western,993735,6.500000
western,998673,7.300000
western,998674,7.300000
western,998675,7.300000


In [67]:
# 所以分组逻辑就是，有哪些流派是吧动作片喜剧片惊悚片等等，然后这个流派下又有哪些演员演过？然后里面计算imdb的平均分是多少

In [68]:
imdbscore_groupby = pd.pivot_table(credit_title,index =["genres","person_id"],values = "imdb_score",aggfunc = np.mean)  

  imdbscore_groupby = pd.pivot_table(credit_title,index =["genres","person_id"],values = "imdb_score",aggfunc = np.mean)


In [69]:
imdbscore_groupby

Unnamed: 0_level_0,Unnamed: 1_level_0,imdb_score
genres,person_id,Unnamed: 2_level_1
action,1000,6.866667
action,100007,7.000000
action,100013,6.400000
action,100019,6.500000
action,100020,6.500000
...,...,...
western,993735,6.500000
western,998673,7.300000
western,998674,7.300000
western,998675,7.300000


In [70]:
reset_imdbscore_groupby = imdbscore_groupby.reset_index()

In [71]:
reset_imdbscore_groupby 

Unnamed: 0,genres,person_id,imdb_score
0,action,1000,6.866667
1,action,100007,7.000000
2,action,100013,6.400000
3,action,100019,6.500000
4,action,100020,6.500000
...,...,...,...
168876,western,993735,6.500000
168877,western,998673,7.300000
168878,western,998674,7.300000
168879,western,998675,7.300000


In [72]:
genres_max_scores = reset_imdbscore_groupby.groupby("genres")["imdb_score"].max()

In [73]:
genres_max_scores

genres
action           9.3
animation        9.3
comedy           9.2
crime            9.5
documentation    9.1
drama            9.5
european         8.9
family           9.3
fantasy          9.3
history          9.1
horror           9.0
music            8.8
reality          8.9
romance          9.2
scifi            9.3
sport            9.1
thriller         9.5
war              8.8
western          8.9
Name: imdb_score, dtype: float64

In [74]:
genres_max_score_with_person_id = pd.merge(reset_imdbscore_groupby,genres_max_scores,on = ["genres","imdb_score"])

In [75]:
genres_max_score_with_person_id

Unnamed: 0,genres,person_id,imdb_score
0,action,12790,9.3
1,action,1303,9.3
2,action,21033,9.3
3,action,336830,9.3
4,action,86591,9.3
...,...,...,...
131,war,826547,8.8
132,western,22311,8.9
133,western,28166,8.9
134,western,28180,8.9


In [76]:
actor_id_with_names = cleaned_credit[["person_id","name"]].drop_duplicates()  #提取出来的df再使用删除重复行

In [77]:
actor_id_with_names.head(10)

Unnamed: 0,person_id,name
0,3748,Robert De Niro
1,14658,Jodie Foster
2,7064,Albert Brooks
3,3739,Harvey Keitel
4,48933,Cybill Shepherd
5,32267,Peter Boyle
6,519612,Leonard Harris
7,29068,Diahnne Abbott
8,519613,Gino Ardito
9,3308,Martin Scorsese


In [78]:
genres_max_score_with_actor_name = pd.merge(genres_max_score_with_person_id,actor_id_with_names,on = "person_id")

In [79]:
genres_max_score_with_actor_name

Unnamed: 0,genres,person_id,imdb_score,name
0,action,12790,9.3,Olivia Hack
1,action,1303,9.3,Jessie Flower
2,action,21033,9.3,Zach Tyler
3,action,336830,9.3,André Sogliuzzo
4,action,86591,9.3,Cricket Leigh
...,...,...,...,...
131,war,826547,8.8,Yuto Uemura
132,western,22311,8.9,Koichi Yamadera
133,western,28166,8.9,Megumi Hayashibara
134,western,28180,8.9,Unsho Ishizuka


为了把相同流派都排序在一起，我们还可以用`sort_values`方法，把结果里面的行根据`genres`进行排序，然后用`reset_index`把索引重新排序。

索引重新排序后，DataFrame会多出`index`一列，我们可以再把`index`列进行删除。

In [80]:
genres_max_score_with_actor_name = genres_max_score_with_actor_name.sort_values("genres").reset_index().drop("index", axis=1)
genres_max_score_with_actor_name

Unnamed: 0,genres,person_id,imdb_score,name
0,action,12790,9.3,Olivia Hack
1,action,1303,9.3,Jessie Flower
2,action,21033,9.3,Zach Tyler
3,action,336830,9.3,André Sogliuzzo
4,action,86591,9.3,Cricket Leigh
...,...,...,...,...
131,war,826547,8.8,Yuto Uemura
132,western,28180,8.9,Unsho Ishizuka
133,western,22311,8.9,Koichi Yamadera
134,western,28166,8.9,Megumi Hayashibara
