## Project 3 (API Data Collection)

## Name : Bader Abanmi 
## Kaggle Link : https://www.kaggle.com/isbader/movie-dataset

### Problem Statement

**The movie industry is one of the largest entertainment industries in our time. It has been estimated that there are 500000 movies produced up to this point. These movies take a lot of time to make.
I will try to use the IMDB API to try and make a dataset that will help me understand the movie features and what contributes to a movie
being successful.**

|Feature|Type|Description|
|---|---|---|
|name|object|movie name|
|budget|int64|movie budget|
|release_date|object|release date of the movie|
|vote_average|float64|vote_average 0 - 10|
|adult|bool|This column gives if movie is for adults or no|
|id|int64|Movie ID in the IMDB API |
|language|object|Language code ex..(en = English)|
|popularity|float64|Movie Popularity|
|revenue|int64|Revenue of movie in one year|
|runtime|float64|Movie run time in minutes|
|status|object|Movie Status ex .(Released)|
|tagline|object|Short tag line when marketing the movie |
|title|object|movie title if there was an original title|
|vote_count|int64|Count of people that voted|
|First_genre|object|First genre of movie|
|Second_genre|object|Second genre of movie|
|Third_genre|object|Third genre of movie|
|collection_name|object|Collection name if movie is part of a collection|

### Imports

In [1]:
import requests
import numpy as np
import pandas as pd

#### Understanding the API

In [2]:
data = requests.get("https://api.themoviedb.org/3/movie/9281?api_key=018dc56c083a046302ece29b691755ad").json() 
data #Check the API Data

{'adult': False,
 'backdrop_path': '/QHlIf8JgLDOMpEOeuraSKTTjkJ.jpg',
 'belongs_to_collection': None,
 'budget': 12000000,
 'genres': [{'id': 80, 'name': 'Crime'},
  {'id': 18, 'name': 'Drama'},
  {'id': 10749, 'name': 'Romance'},
  {'id': 53, 'name': 'Thriller'}],
 'homepage': None,
 'id': 9281,
 'imdb_id': 'tt0090329',
 'original_language': 'en',
 'original_title': 'Witness',
 'overview': "A sheltered Amish child is the sole witness of a brutal murder in a restroom at a Philadelphia train station, and he must be protected.  The assignment falls to a taciturn detective who goes undercover in a Pennsylvania Dutch community. On the farm, he slowly assimilates despite his urban grit and forges a romantic bond with the child's beautiful mother.",
 'popularity': 10.245,
 'poster_path': '/pQZa314NJP3ieMAj6CgPI1v7nUY.jpg',
 'production_companies': [{'id': 4,
   'logo_path': '/fycMZt242LVjagMByZOLUGbCvv3.png',
   'name': 'Paramount',
   'origin_country': 'US'}],
 'production_countries': [{'is

**The MovieDB API only fetches one movie at a time**

In [3]:
data.keys() #The API Provides these features for each movie

dict_keys(['adult', 'backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id', 'imdb_id', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'video', 'vote_average', 'vote_count'])

In [4]:
data = requests.get("https://api.themoviedb.org/3/movie/{}?api_key=018dc56c083a046302ece29b691755ad".format(989)).json()
print(data['id'],data['title']) #Testing the fetching mechanism

989 The Mortal Storm


In [5]:
#This function will check if the movie ID is avalible or not (To avoid key_error)
def check_ava(counter):
    data = requests.get("https://api.themoviedb.org/3/movie/{}?api_key=018dc56c083a046302ece29b691755ad".format(counter)).json()
    try:
        a=data['id']
        return True
    except KeyError:
        return False

#Create lists to append the needed information

name = []
genres = []
budget = []
release_date = []
vote_average = []
adult = []
belongs_to_collection = []
movie_id = []
original_language = []
popularity = []
revenue = []
runtime = []
status = []
tagline = []
title = []
vote_count = []

# This function is a loop that takes manual movie ID's and retrieve the nassassary data
def import_movies(start,finish):
   
    for i in range(start,finish): #The function takes two parameters of where the surch starts and where the surch ends
        data = requests.get("https://api.themoviedb.org/3/movie/{}?api_key=018dc56c083a046302ece29b691755ad".format(i)).json() #Applied the API and looped over it using the API to itirate through the set range

        if check_ava(i) == True: #Applied check_ava to check if the movie ID is avalible or not
                #Appending the information to the lists created
                name.append(data['original_title'])
                genres.append(data['genres'])
                budget.append(data['budget'])
                release_date.append(data['release_date'])
                vote_average.append(data['vote_average'])
                adult.append(data['adult'])
                belongs_to_collection.append(data['belongs_to_collection'])
                movie_id.append(data['id'])
                original_language.append(data['original_language'])
                popularity.append(data['popularity'])
                #production_countries.append(data['production_countries'])
                revenue.append(data['revenue'])
                runtime.append(data['runtime'])
                status.append(data['status'])
                tagline.append(data['tagline'])
                title.append(data['title'])
                vote_count.append(data['vote_count'])
        else:
            pass



In [6]:
import_movies(1,17) #Use the import_movies function to import movies from the API

In [7]:
#Created a dataframe containing the lists that were created then changing it to a DataFrame and saving it 
movie_data = {'name':name,
              'genres':genres,
              'budget':budget,
              'release_date':release_date,
              'vote_average':vote_average,
              'adult':adult,
              'collection':belongs_to_collection,
              'id':movie_id,
              'language':original_language,
              'popularity':popularity,
              'revenue':revenue,
              'runtime':runtime,
              'status':status,
              'tagline':tagline,
              'title':title,
              'vote_count':vote_count}

movies = pd.DataFrame(movie_data)

movies.to_csv('movie_data_?_?.csv', index = False) 




In [8]:
movies

Unnamed: 0,name,genres,budget,release_date,vote_average,adult,collection,id,language,popularity,revenue,runtime,status,tagline,title,vote_count
0,Ariel,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",0,1988-10-21,6.9,False,,2,fi,8.848,0,73,Released,,Ariel,86
1,Varjoja paratiisissa,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",0,1986-10-17,7.4,False,,3,fi,9.169,0,72,Released,,Shadows in Paradise,97
2,Four Rooms,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",4000000,1995-12-09,6.1,False,,5,en,11.547,4300000,98,Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,1342
3,Judgment Night,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",21000000,1993-10-15,6.6,False,,6,en,8.485,12136938,110,Released,Don't move. Don't whisper. Don't even breathe.,Judgment Night,128
4,Life in Loops (A Megacities RMX),"[{'id': 99, 'name': 'Documentary'}]",42000,2006-01-01,7.4,False,,8,en,1.458,0,80,Released,A Megacities remix.,Life in Loops (A Megacities RMX),9
5,Sonntag im August,"[{'id': 18, 'name': 'Drama'}]",0,2004-09-02,6.7,False,,9,de,0.808,0,15,Released,,Sunday in August,6
6,Star Wars,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",11000000,1977-05-25,8.2,False,"{'id': 10, 'name': 'Star Wars Collection', 'po...",11,en,52.923,775398007,121,Released,"A long time ago in a galaxy far, far away...",Star Wars,12268
7,Finding Nemo,"[{'id': 16, 'name': 'Animation'}, {'id': 10751...",94000000,2003-05-30,7.8,False,"{'id': 137697, 'name': 'Finding Nemo Collectio...",12,en,21.367,940335536,100,Released,There are 3.7 trillion fish in the ocean. They...,Finding Nemo,12496
8,Forrest Gump,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",55000000,1994-07-06,8.4,False,,13,en,27.955,677945399,142,Released,Life is like a box of chocolates...you never k...,Forrest Gump,16137
9,American Beauty,"[{'id': 18, 'name': 'Drama'}]",15000000,1999-09-15,8.0,False,,14,en,21.979,356296601,122,Released,Look closer.,American Beauty,7072


### Data Cleaning 

In [10]:
data1 = pd.read_csv('Fetched_Data/movie_data_1000_5000.csv') #We saved the data in small sets to account for fetching delays

In [11]:
data2 = pd.read_csv('Fetched_Data/movie_data_5000_7000.csv')

In [12]:
data3 = pd.read_csv('Fetched_Data/movie_data_7000_10000.csv')

In [13]:
data4 = pd.read_csv('Fetched_Data/movie_data_10000_12000.csv')

In [14]:
data5 = pd.read_csv('Fetched_Data/movie_data_12000_15000.csv')

In [15]:
data6 = pd.read_csv('Fetched_Data/movie_data_15000_17000.csv')

In [16]:
data = pd.concat([data1,data2,data3,data4,data5,data6]) #Concat the data into one dataframe
data.shape

(9910, 16)

In [17]:
data.drop_duplicates(subset=None, keep='first', inplace=True) #Drop the duplicates

In [18]:
data.columns

Index(['name', 'genres', 'budget', 'release_date', 'vote_average', 'adult',
       'collection', 'id', 'language', 'popularity', 'revenue', 'runtime',
       'status', 'tagline', 'title', 'vote_count'],
      dtype='object')

In [19]:
data.shape

(8603, 16)

In [21]:
data.head(5)

Unnamed: 0,name,genres,budget,release_date,vote_average,adult,collection,id,language,popularity,revenue,runtime,status,tagline,title,vote_count
0,Mulholland Drive,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",15000000,2001-09-08,7.8,False,,1018,en,16.046,20117339,147.0,Released,An actress longing to be a star. A woman searc...,Mulholland Drive,3192
1,Adams æbler,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",3700000,2005-04-15,7.5,False,,1023,da,8.238,0,94.0,Released,"When it rains, it pours",Adam's Apples,324
2,Heavenly Creatures,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",5000000,1994-09-12,7.1,False,,1024,en,11.453,3049135,109.0,Released,Not all angels are innocent.,Heavenly Creatures,503
3,Die Siebtelbauern,"[{'id': 18, 'name': 'Drama'}]",0,1998-06-19,7.3,False,,1039,de,1.919,0,95.0,Released,The rich get richer... and sometimes the poor ...,The Inheritors,9
4,Il gattopardo,"[{'id': 18, 'name': 'Drama'}]",0,1963-03-28,7.7,False,,1040,it,9.634,0,186.0,Released,Luchino Visconti's enduring romantic adventure,The Leopard,317


In [20]:
# We are getting the genres as a string therefore we will split the string to make it have three genres.
import ast # We import ast to evalute the string to change it from string to list

# We make three lists to contain the three genres
First_genre = []
Second_genre = []
Third_genre = []

for i in data['genres']: #We will loop over the list of genres and apend the vlaues to the lists we created
    #print(ast.literal_eval(i)) #check the work 
    if len(ast.literal_eval(i)) == 3: #First condition evalutes if the list len is 3 to append the needed values 
        First_genre.append(ast.literal_eval(i)[0]['name']) 
        Second_genre.append(ast.literal_eval(i)[1]['name'])
        Third_genre.append(ast.literal_eval(i)[2]['name'])
    elif len(ast.literal_eval(i)) == 2:#Second condition evalutes if the list len is 3 to append the needed values 
        First_genre.append(ast.literal_eval(i)[0]['name'])
        Second_genre.append(ast.literal_eval(i)[1]['name'])
        Third_genre.append(None)
    elif len(ast.literal_eval(i)) == 1:#First condition evalutes if the list len is 3 to append the needed values
        First_genre.append(ast.literal_eval(i)[0]['name'])
        Second_genre.append(None)
        Third_genre.append(None)
    else: #Else append all values with none
        First_genre.append(None)
        Second_genre.append(None)
        Third_genre.append(None)

In [21]:
data.head(5)

Unnamed: 0,name,genres,budget,release_date,vote_average,adult,collection,id,language,popularity,revenue,runtime,status,tagline,title,vote_count
0,Mulholland Drive,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",15000000,2001-09-08,7.8,False,,1018,en,16.046,20117339,147.0,Released,An actress longing to be a star. A woman searc...,Mulholland Drive,3192
1,Adams æbler,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",3700000,2005-04-15,7.5,False,,1023,da,8.238,0,94.0,Released,"When it rains, it pours",Adam's Apples,324
2,Heavenly Creatures,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",5000000,1994-09-12,7.1,False,,1024,en,11.453,3049135,109.0,Released,Not all angels are innocent.,Heavenly Creatures,503
3,Die Siebtelbauern,"[{'id': 18, 'name': 'Drama'}]",0,1998-06-19,7.3,False,,1039,de,1.919,0,95.0,Released,The rich get richer... and sometimes the poor ...,The Inheritors,9
4,Il gattopardo,"[{'id': 18, 'name': 'Drama'}]",0,1963-03-28,7.7,False,,1040,it,9.634,0,186.0,Released,Luchino Visconti's enduring romantic adventure,The Leopard,317


In [22]:
collection_name = []
for i in data['collection']: #This loop will loop over the collection and grab the name
    #print(type(i))
    if type(i) == str:
        collection_name.append(ast.literal_eval(i)['name'])
    else:
        collection_name.append(None)
        
        

In [23]:
# Here we are adding the collection name to our dataframe
data['collection_name'] = collection_name
# Here we are adding the genres to our DataFrame
data['First_genre'] = First_genre
data['Second_genre'] = Second_genre
data['Third_genre'] = Third_genre

In [24]:
data.head(40)

Unnamed: 0,name,genres,budget,release_date,vote_average,adult,collection,id,language,popularity,revenue,runtime,status,tagline,title,vote_count,collection_name,First_genre,Second_genre,Third_genre
0,Mulholland Drive,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",15000000,2001-09-08,7.8,False,,1018,en,16.046,20117339,147.0,Released,An actress longing to be a star. A woman searc...,Mulholland Drive,3192,,Thriller,Drama,Mystery
1,Adams æbler,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",3700000,2005-04-15,7.5,False,,1023,da,8.238,0,94.0,Released,"When it rains, it pours",Adam's Apples,324,,Drama,Comedy,Crime
2,Heavenly Creatures,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name...",5000000,1994-09-12,7.1,False,,1024,en,11.453,3049135,109.0,Released,Not all angels are innocent.,Heavenly Creatures,503,,Drama,Fantasy,
3,Die Siebtelbauern,"[{'id': 18, 'name': 'Drama'}]",0,1998-06-19,7.3,False,,1039,de,1.919,0,95.0,Released,The rich get richer... and sometimes the poor ...,The Inheritors,9,,Drama,,
4,Il gattopardo,"[{'id': 18, 'name': 'Drama'}]",0,1963-03-28,7.7,False,,1040,it,9.634,0,186.0,Released,Luchino Visconti's enduring romantic adventure,The Leopard,317,,Drama,,
5,Sommersby,"[{'id': 18, 'name': 'Drama'}, {'id': 53, 'name...",0,1993-02-05,6.3,False,,1049,en,9.177,140081992,109.0,Released,She knew his face. His touch. His voice. She k...,Sommersby,175,,,,
6,洗澡,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",0,1999-09-13,7.2,False,,1050,zh,2.519,0,92.0,Released,,Shower,29,,Comedy,Drama,
7,The French Connection,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",1800000,1971-10-09,7.6,False,"{'id': 155474, 'name': 'French Connection Coll...",1051,en,16.423,41158757,104.0,Released,There are no rules and no holds barred when Po...,The French Connection,819,French Connection Collection,Action,Crime,Thriller
8,Blow-Up,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",1800000,1966-12-18,7.5,False,,1052,en,12.83,0,111.0,Released,Michelangelo Antonioni's first British film,Blow-Up,555,,Drama,Mystery,Thriller
9,Breathless,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",0,1983-05-13,5.9,False,,1058,en,9.396,19910002,100.0,Released,He's the last man on earth any woman needs - b...,Breathless,108,,,,


In [25]:
#Deleting unnassary columns
del data['genres']
del data['collection']

In [28]:
data.head()
data.dtypes

name                object
budget               int64
release_date        object
vote_average       float64
adult                 bool
id                   int64
language            object
popularity         float64
revenue              int64
runtime            float64
status              object
tagline             object
title               object
vote_count           int64
collection_name     object
First_genre         object
Second_genre        object
Third_genre         object
dtype: object

In [33]:
data.to_csv('Project3_data.csv', index = False) 

In [34]:
data.shape

(8603, 18)