# Gathering Data

This notebook has the purpose of extracting and cleanning data from **imdb datasets**. The data is divided int folders, one for each year, from 1960 to 2024. Each folder has 3 datasets: imdb_movies.csv, advanced_movies_details and merged_movies_data. In this analysis, only the **merged_movies_data** will be used. 

### Importing Libraries

In [157]:
import numpy as np
import pandas as pd
import os
import re

### Defining Variables

In [158]:
root_dir = "Data"
files_starts = "merged_movies_data_"

### Extracting and Aggregating

In [None]:
##Creating dataframe
merged_data = pd.DataFrame()

##getting folders in root
for folder in os.listdir(root_dir):
    folder_path = os.path.join(root_dir, folder)
    
    ##acessing each folder
    if os.path.isdir(folder_path):
        for file in os.listdir(folder_path):
            ##selecting the file of interest
            if file.startswith(files_starts):
                file_path = os.path.join(folder_path, file)
                
                ##reading csc
                data = pd.read_csv(file_path)
                
                #Cleaning the Movie Link column
                data['Movie Link'] = data['Movie Link'].apply(lambda x: re.sub(r'/\?ref_=.*$', '', str(x)))

                ##Extracting ID field from moview link
                data['id'] = data['Movie Link'].apply(lambda x: x.split('/')[-1] if '/' in x else None)
                data['id'] = data['id'].apply(lambda x: int(''.join(re.findall(r'\d+', x))) if re.findall(r'\d+', x) else None)
                

                ##Cleaning the Movie Title column
                data['Title'] = data['Title'].apply(lambda x: re.sub(r'^\d+\.\s*', '', str(x)).strip())

                ##Concating dataframes, ignoring index
                merged_data = pd.concat([merged_data, data], ignore_index=True)

In [160]:
##Reorganizing columns order to bring ID to first column
columns_order = ['id', 'Title'] + [col for col in merged_data.columns if col not in ['id', 'Title']]
merged_data = merged_data[columns_order]

In [161]:
##checking dataframe size
print(f"Merged data shape: {merged_data.shape}")

Merged data shape: (33600, 24)


The dataframe has 33.600 lines and 24 columns.

In [162]:
##Checking if there are duplicated ID in the dataframe
duplicate_values = merged_data[merged_data['id'].duplicated()==True]
print(duplicate_values)

Empty DataFrame
Columns: [id, Title, Movie Link, Year, Duration, MPA, Rating, Votes, budget, grossWorldWide, gross_US_Canada, opening_weekend_Gross, directors, writers, stars, genres, countries_origin, filming_locations, production_companies, Languages, wins, nominations, oscars, release_date]
Index: []

[0 rows x 24 columns]


There is no duplicated columns in the dataframe.

In [163]:
##displaying data
merged_data.head()

Unnamed: 0,id,Title,Movie Link,Year,Duration,MPA,Rating,Votes,budget,grossWorldWide,...,stars,genres,countries_origin,filming_locations,production_companies,Languages,wins,nominations,oscars,release_date
0,54357,Swiss Family Robinson,https://www.imdb.com/title/tt0054357,1960,2h 6m,Approved,7.1,19K,5000000.0,40357287.0,...,"['John Mills', 'Dorothy McGuire', 'James MacAr...","['Survival', 'Adventure', 'Family']",['United States'],"['Tobago, Trinidad and Tobago']",['Walt Disney Productions'],"['English', 'Malay']",0,0,0,1960.0
1,54215,Psycho,https://www.imdb.com/title/tt0054215,1960,1h 49m,R,8.5,741K,806947.0,32066835.0,...,"['Anthony Perkins', 'Janet Leigh', 'Vera Miles']","['Psychological Horror', 'Psychological Thrill...",['United States'],"['Psycho House and Bates Motel, Backlot Univer...","['Alfred J. Hitchcock Productions', 'Shamley P...",['English'],0,14,4,1960.0
2,53604,The Apartment,https://www.imdb.com/title/tt0053604,1960,2h 5m,Approved,8.3,204K,3000000.0,18778738.0,...,"['Jack Lemmon', 'Shirley MacLaine', 'Fred MacM...","['Farce', 'Holiday Comedy', 'Holiday Romance',...",['United States'],"['Majestic Theater, 247 West 44th Street, Manh...",['The Mirisch Corporation'],['English'],0,8,0,1960.0
3,54331,Spartacus,https://www.imdb.com/title/tt0054331,1960,3h 17m,PG-13,7.9,146K,12000000.0,1846975.0,...,"['Kirk Douglas', 'Laurence Olivier', 'Jean Sim...","['Adventure Epic', 'Historical Epic', 'Sword &...",['United States'],"['Hearst Castle, San Simeon, California, USA']",['Bryna Productions'],['English'],0,11,0,1960.0
4,53472,Breathless,https://www.imdb.com/title/tt0053472,1960,1h 30m,Not Rated,7.7,90K,400000.0,594039.0,...,"['Jean-Paul Belmondo', 'Jean Seberg', 'Van Dou...","['Caper', 'Crime', 'Drama']",['France'],"['11 rue Campagne Première, Paris 14, Paris, F...","['Les Films Impéria', 'Les Productions Georges...","['French', 'English']",0,4,1,1960.0


### Loading the new dataframe

The output will be saved in a new dataframe to facilitate cleaning and further analysis.

In [166]:
merged_data.to_csv('Final Dataset/final_dataset.csv', index=False)