# More JOIN operations

![rel](https://sqlzoo.net/w/images/1/10/Movie-er.png)

In [1]:
import findspark
import pandas as pd
findspark.init()

SVR = '192.168.31.31'
from pyspark.sql import SparkSession

sc = (SparkSession.builder.appName('app07') 
      .master(f'spark://{SVR}:7077') 
      .config('spark.sql.warehouse.dir', f'hdfs://{SVR}:9000/user/hive/warehouse') 
      .config('spark.cores.max', '4') 
      .config('spark.executor.instances', '1') 
      .config('spark.executor.cores', '2') 
      .config('spark.executor.memory', '10g') 
      .enableHiveSupport().getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 1. 1962 movies

List the films where the **yr** is 1962 [Show **id, title**]

In [2]:
movie = sc.read.table('sqlzoo.movie')
actor = sc.read.table('sqlzoo.actor')
casting = sc.read.table('sqlzoo.casting')

In [3]:
movie.filter(movie['yr']==1962).select('id', 'title').toPandas()

                                                                                

Unnamed: 0,id,title
0,10212,A Kind of Loving
1,10329,A Symposium on Popular Songs
2,10347,A Very Private Affair (Vie PrivÃ©e)
3,10648,An Autumn Afternoon
4,10868,Atraco a las tres
...,...,...
81,21324,Two Half Times in Hell
82,21462,Varan the Unbelievable
83,21494,Village of Daughters
84,21673,What Ever Happened to Baby Jane?


## 2. When was Citizen Kane released?

Give year of 'Citizen Kane'.

In [4]:
movie.filter(movie['title']=='Citizen Kane').select('yr').toPandas()

Unnamed: 0,yr
0,1941


## 3. Star Trek movies

List all of the Star Trek movies, include the **id**, **title** and **yr** (all of these movies include the words Star Trek in the title). Order results by year.

In [5]:
from pyspark.sql.functions import *
(movie.filter(lower(movie['title']).contains('star trek')).fillna(False)
         .select('id', 'title', 'yr').toPandas())

Unnamed: 0,id,title,yr
0,17770,Star Trek: First Contact,1996
1,17771,Star Trek: Insurrection,1998
2,17772,Star Trek: The Motion Picture,1979
3,17773,Star Trek,2009
4,17774,Star Trek Generations,1994
5,17775,Star Trek II: The Wrath of Khan,1982
6,17776,Star Trek III: The Search for Spock,1984
7,17777,Star Trek IV: The Voyage Home,1986
8,17778,Star Trek Nemesis,2002
9,17779,Star Trek V: The Final Frontier,1989


## 4. id for actor Glenn Close

What **id** number does the actor 'Glenn Close' have?

In [6]:
actor.filter(actor['name']=='Glenn Close').select('id').toPandas()

Unnamed: 0,id
0,140


## 5. id for Casablanca

What is the **id** of the film 'Casablanca'

In [7]:
movie.filter(movie['title']=='Casablanca').select('id').toPandas()

Unnamed: 0,id
0,11768


## 6. Cast list for Casablanca

Obtain the cast list for 'Casablanca'.

> _what is a cast list?_  
> The cast list is the names of the actors who were in the movie.

Use **movieid=11768**, (or whatever value you got from the previous question)

In [8]:
(casting.join(actor, casting['actorid']==actor['id'])
 .filter("movieid==11768")
 .select('name').toPandas())

Unnamed: 0,name
0,Peter Lorre
1,John Qualen
2,Madeleine LeBeau
3,Jack Benny
4,Dan Seymour
5,Norma Varden
6,Ingrid Bergman
7,Conrad Veidt
8,Leon Belasco
9,Humphrey Bogart


## 7. Alien cast list

Obtain the cast list for the film 'Alien'

In [9]:
a = (movie.join(casting, movie['id']==casting['movieid'], how='right')
    .join(actor, col('actorid')==actor['id'], how='left')
    .drop(actor['id']))
a.filter("title=='Alien'").select('name').toPandas()

Unnamed: 0,name
0,John Hurt
1,Sigourney Weaver
2,Yaphet Kotto
3,Harry Dean Stanton
4,Ian Holm
5,Tom Skerritt
6,Veronica Cartwright


## 8. Harrison Ford movies

List the films in which 'Harrison Ford' has appeared

In [10]:
# a was obtained in #7
a.filter(a['name']=='Harrison Ford').select('title').toPandas()

Unnamed: 0,title
0,A Hundred and One Nights
1,Air Force One
2,American Graffiti
3,Apocalypse Now
4,Clear and Present Danger
5,Cowboys & Aliens
6,Crossing Over
7,Dead Heat on a Merry-Go-Round
8,Extraordinary Measures
9,Firewall


## 9. Harrison Ford as a supporting actor

List the films where 'Harrison Ford' has appeared - but not in the starring role. [Note: the ord field of casting gives the position of the actor. If ord=1 then this actor is in the starring role]

In [11]:
# a was obtained in #7
a.filter((a['name']=='Harrison Ford') & (a['ord']>1)).select('title').toPandas()

Unnamed: 0,title
0,A Hundred and One Nights
1,American Graffiti
2,Apocalypse Now
3,Cowboys & Aliens
4,Dead Heat on a Merry-Go-Round
5,Extraordinary Measures
6,Force 10 From Navarone
7,Hawthorne of the U.S.A.
8,Jimmy Hollywood
9,More American Graffiti


## 10. Lead actors in 1962 movies

List the films together with the leading star for all 1962 films.

In [12]:
# a was obtained in #7
a.filter((a['yr']==1962) & (a['ord']==1)).select('title', 'name').toPandas()

Unnamed: 0,title,name
0,A Kind of Loving,Alan Bates
1,A Symposium on Popular Songs,Paul Frees
2,A Very Private Affair (Vie PrivÃ©e),Brigitte Bardot
3,An Autumn Afternoon,Chishu Ryu
4,Atraco a las tres,JosÃ© Luis LÃ³pez VÃ¡zquez
...,...,...
79,Two Half Times in Hell,Imre Sinkovits
80,Varan the Unbelievable,KÃ´zÃ´ Nomura
81,Village of Daughters,Eric Sykes
82,What Ever Happened to Baby Jane?,Bette Davis


## 11. Busy years for Rock Hudson

Which were the busiest years for 'Rock Hudson', show the year and the number of movies he made each year for any year in which he made more than 2 movies.

In [13]:
# a was obtained in #7
(a.filter(a['name']=='Rock Hudson')
    .select('yr', 'title')
    .groupBy('yr').count()
    .filter("count>2")
    .toPandas())

Unnamed: 0,yr,count
0,1961,3
1,1953,5


## 12. Lead actor in Julie Andrews movies

List the film title and the leading actor for all of the films 'Julie Andrews' played in.

> _Did you get "Little Miss Marker twice"?_   
> Julie Andrews starred in the 1980 remake of Little Miss Marker and not the original(1934).
>
> Title is not a unique field, create a table of IDs in your subquery

In [14]:
# a was obtained in #7
b = [x.movieid for x in a.filter(a['name']=='Julie Andrews').select('movieid').collect()]
a.filter((a['movieid'].isin(b)) & (a['ord']==1)).select('title', 'name').toPandas()

Unnamed: 0,title,name
0,10,Dudley Moore
1,Darling Lili,Julie Andrews
2,Despicable Me,Steve Carell
3,Duet for One,Julie Andrews
4,Hawaii,Julie Andrews
5,Little Miss Marker,Walter Matthau
6,Mary Poppins,Julie Andrews
7,Relative Values,Julie Andrews
8,Shrek the Third,Mike Myers
9,Star!,Julie Andrews


## 13. Actors with 15 leading roles

Obtain a list, in alphabetical order, of actors who've had at least 15 **starring** roles.

In [15]:
# a was obtained in #7
(a.filter(a['ord']==1)
    .groupBy('actorid', 'name')
    .count()
    .filter("count>=15")
    .select('name')
     .filter(~isnull(col('name')))
    .orderBy('name')
    .toPandas())
# a.filter(isnull(a.name)).toPandas()

Unnamed: 0,name
0,Adam Sandler
1,Al Pacino
2,Anthony Hopkins
3,Antonio Banderas
4,Arnold Schwarzenegger
...,...
95,Tommy Lee Jones
96,Tyrone Power
97,Walter Matthau
98,William Garwood


## 14.
List the films released in the year 1978 ordered by the number of actors in the cast, then by title.

In [16]:
# a was obtained in #7
(a.filter(a['yr']==1978)
    .groupBy('title', 'movieid')
    .count()
    .select('title', 'count')
    .orderBy(col('count').desc(), 'title')
    .toPandas())

Unnamed: 0,title,count
0,The Bad News Bears Go to Japan,50
1,The Swarm,37
2,Grease,28
3,American Hot Wax,27
4,The Boys from Brazil,26
...,...,...
99,Lies My Father Told Me,2
100,"Same Time, Next Year",2
101,Somebody Killed Her Husband,2
102,That's Carry On!,2


## 15.

List all the people who have worked with 'Art Garfunkel'.

In [17]:
# a was obtained in #7
b = [x.movieid for x in a.filter(a['name']=='Art Garfunkel')
     .select('movieid').collect()]
(a.filter((a['movieid'].isin(b)) & (a['name'] != 'Art Garfunkel'))
 .select('name')
 .orderBy('name').toPandas())

Unnamed: 0,name
0,Beverly Johnson
1,Bill Paxton
2,Breckin Meyer
3,Bruce Jay Friedman
4,Cecilie Thomsen
5,Cindy Crawford
6,Donald Trump
7,Elio Fiorucci
8,Ellen Albertini Dow
9,Frederique van der Wal


In [18]:
sc.stop()