Tiki API Data to PostgreSQL

Overview

This repository provides a Python-based analysis pipeline (main.py) that performs deep cleaning, data validation, deduplication, and exploratory analysis on the TMDB movie dataset.

It is designed for stability, reproducibility, and reporting, combining 15 analysis scripts into a single sequential workflow.

Features

🔄 Downloads TMDB dataset automatically from GitHub
🧹 Cleans messy CSVs and fixes ambiguous fields
⚠️ Detects unusual characters and suspicious records
♻️ Removes duplicates and enforces correct data types
📊 Provides statistics, top movie lists, and genre summaries
💰 Calculates profits, top P&L, top companies, and revenue stats
📈 Visualizations:
- Popularity vs Profit (Top 100 movies)
- Budget vs Profit (All movies)

Project Structure

File	Description
`main.py`	Combined pipeline of all 15 scripts
`tmdb-movies.csv`	Raw TMDB dataset downloaded from GitHub
`movies-clean.csv`	Cleaned CSV after removing inconsistencies
`clean-data.csv`	Deduplicated CSV
`release_desc.csv`	Movies sorted by release date descending
`avg_rate.csv`	Movies with high ratings (≥7.5)
`Min_max.csv`	Min and Max revenue movies
`top10_profit.csv`	Top 10 movies by profit
`Dir-Act.csv`	Top director and actor
`genres.csv`	Movies by genre
`TopP&L.csv`	Top profitable and biggest loss movies
`TopCompany.csv`	Top production companies (count & total profit)
`unusual_characters_report.tsv`	Report on unusual characters per column
`suspicious_records.csv`	Records with short, numeric, or date-like titles
`tmdb_analysis.log`	Pipeline execution log

Release Date Boundaries

The release_date column is processed with year bounds 1900–2015:

1900: Based on the start of modern movie history.
2015: The latest movie in the dataset is dated 31/12/2015.
Ensures the pipeline uses realistic, dataset-relevant boundaries and avoids misinterpreting ambiguous two-digit years.

Installation

Clone the repository:

git clone https://github.com/ndlryan/TMDB-Movie-Deep-Clean-Analysis.git
cd TMDB-Movie-Deep-Clean-Analysis

Install dependencies:

pip install -r requirements.txt
# Or manually: pandas, matplotlib, numpy

Running the Analysis

Run the pipeline:

python main.py

This will:

Download the TMDB dataset
Generate unusual character report
Clean and deduplicate the dataset
Detect suspicious records
Analyze and sort movies (release date, rating, revenue, profit)
Summarize directors, actors, genres, and companies
Generate visualizations for popularity vs profit and budget vs profit

All output files will be saved in the project folder.

Visualizations Insisghts

1. Popularity vs Profit (Top 100)

Audience attention doesn’t always translate into box office success.
💡 Observation: High popularity does not guarantee high profit — audience attention is not the same as box office success.

2. Budget vs Profit (All movies)

Big investments don’t guarantee big returns — even high-budget films can lose money.
💡 Observation: Large budgets do not guarantee large profits — big investment is no guarantee of box office success.

Notes

Always ensure a stable internet connection to download the dataset.

CSV outputs are overwritten on each run; backup if needed.

For large-scale analysis, increase your system memory if needed.

Author

Ryan
GitHub Profile

A complete, sequential, and fault-tolerant TMDB movie dataset analysis pipeline — combining cleaning, validation, exploration, and visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.idea		.idea
README.md		README.md
budget_vs_profit.png		budget_vs_profit.png
main.py		main.py
popularity_vs_profit.png		popularity_vs_profit.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tiki API Data to PostgreSQL

Overview

Features

Project Structure

Release Date Boundaries

Installation

Running the Analysis

Visualizations Insisghts

1. Popularity vs Profit (Top 100)

2. Budget vs Profit (All movies)

Notes

Author

About

Uh oh!

Releases

Packages

Languages

ndlryan/TMDB-Movie-Deep-Clean-Analysis

Folders and files

Latest commit

History

Repository files navigation

Tiki API Data to PostgreSQL

Overview

Features

Project Structure

Release Date Boundaries

Installation

Running the Analysis

Visualizations Insisghts

1. Popularity vs Profit (Top 100)

2. Budget vs Profit (All movies)

Notes

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages