Analysis of IMDB Movies Dataset

Dataset: https://datasets.imdbws.com/
Goal: Uncover hidden patterns, trends and anomalies in the movie dataset in hope to solve some interesting questions

Data Mining Process

Obtain Relevant Data -> Contained in the dataset folder

Use 5 different datasets in the form of .tsv files converted into .csv format. Majority of the datasets contain a primary key feature called 'tconst' which is a unique identifier for the movie title. This facilitates merge and join together operations to enhance the # of features thereby allowing me to obtain more relevant data

Data Preparation -> DataPreparation.ipynb

Convert the .tsv files into .csv files. Data exploration followed by data cleansing to fix incorrect, incomplete or duplicate data in the dataset. The next step is to integrate any datasets that have similar features to create unified sets of information for analytical uses. Following that, remove all instances which are discovered by the cleansing step.

Data Visualization -> DataVisualization.ipynb

Use various tools to visualize and plot data in a graphical representation. Use of charts, graphs and maps. Find outliers and hidden patterns in the data. Essential for large datasets.

Data Analysis and Interpretation -> StatisticalAnalysis.ipynb

Visualize data but through statistical anaylsis. Find mode, entropy, median, mean, etc for various feature types. For example, we can determine the average births per year since 1900-Present for movie stars.

Data Mining -> Model.ipynb

Choose appropriate mining techniques and implement ML algorithms to solve a problem. Train model on clean data obtained from previous steps. Compare accuracy of model and fine tune hyper parameters.

Datasets

Analyzed 5 datasets from https://www.imdb.com/interfaces/ with the following feature informations:

Title.akas.tsv.gz - General movie title information | 8 Features:
- titleId (string) : a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title
title.basics.tsv.gz - Contains the following information for titles | 9 Features:
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes (integer) – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
title.crew.tsv.gz – Contains the director & writer information for all the titles in IMDb | 3 Features:
- tconst (string) - unique identifier of the title
- directors (array) - director(s) of the given title
- writers (array) – writer(s) of the given title
title.ratings.tsv.gz – The IMDb rating and votes for titles | 3 Features
- tconst (string) - unique identifier of the title
- averageRating (double) – weighted average of all the individual user ratings
- numVotes (integer)- number of votes the title has received
name.basics.tsv.gz – Contains the following information for names:
- nconst (string) - unique identifier of the person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

Some questions

What genres are people most interested in?
What makes a good title (average rating >=8)?
Are senior actors more popular than junior actors?
Are senior actors better than junior actors?
What makes a title popular (We can use numVotes to determine this)?

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Results		Results
dataset/sampleDataset		dataset/sampleDataset
.gitignore		.gitignore
DataPreparation.ipynb		DataPreparation.ipynb
DataVisualization.ipynb		DataVisualization.ipynb
DecisionTree.png		DecisionTree.png
Model.ipynb		Model.ipynb
README.md		README.md
StatisticalAnalysis.ipynb		StatisticalAnalysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results

Results

dataset/sampleDataset

dataset/sampleDataset

.gitignore

.gitignore

DataPreparation.ipynb

DataPreparation.ipynb

DataVisualization.ipynb

DataVisualization.ipynb

DecisionTree.png

DecisionTree.png

Model.ipynb

Model.ipynb

README.md

README.md

StatisticalAnalysis.ipynb

StatisticalAnalysis.ipynb

Repository files navigation

Analysis of IMDB Movies Dataset

Data Mining Process

Datasets

Some questions

About

Releases

Packages

Languages

harisahmadg/IMDB-Movies-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Analysis of IMDB Movies Dataset

Data Mining Process

Datasets

Some questions

About

Resources

Stars

Watchers

Forks

Languages