# Real-World Data Wrangling

## 1. Gather data

In this section, we will extract data using at least two different data gathering methods and combine the data.

### **1.1.** Problem Statement

How does the streaming revolution compare to traditional theatrical releases in terms of financial performance, audience reception, and content characteristics, and how have these metrics evolved over time when accounting for inflation?

The film industry has undergone a dramatic shift from theatrical dominance to streaming platforms, but how do these distribution channels compare in terms of content characteristics, audience engagement, and financial viability?

This project merges TMDb theatrical movie data, IMDb ratings, four major streaming platform catalogs, and Federal Reserve CPI data to conduct a comprehensive analysis of streaming-exclusive versus theatrical releases. By adjusting all budgets and revenues to 2024 dollars using inflation data, I can fairly compare financial performance across six decades (1960-2025) and determine whether the "streaming revolution" represents truly different content or simply a new distribution channel for similar movies.

The primary data wrangling challenge involves fuzzy matching movies across datasets with inconsistent identifiers, requiring normalized title matching and year-based validation.

### **1.2.** Dataset Overview

This project combines four distinct data sources gathered using three different methods:

| Dataset | Source | Method | Size | Primary Use |
|---------|--------|--------|------|-------------|
| TMDb Movies | Kaggle | Manual Download | 10K movies | Theatrical releases, financial data |
| IMDb Ratings | IMDb Datasets | Programmatic Download | ~50K movies (filtered) | Additional ratings, metadata validation |
| Streaming Platforms | Kaggle | Manual Download | ~22K titles | Streaming availability, exclusive content |
| CPI Inflation Data | FRED API | API Access | 65 years (1960-2025) | Inflation adjustment for financial analysis |

The combination of these datasets enables analysis of streaming versus theatrical releases while ensuring fair financial comparisons across six decades.

In [6]:
# Import statements for packages used in analysis
import httpx
import os
import zlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sb

from dotenv import load_dotenv
from urllib.parse import urlparse

## It's OK if this file doesn't exist as the API results are 
## saved to a file in the project and the API code will check
## for the file before using the API key this would retrieve.
load_dotenv(dotenv_path=f"{os.path.expanduser('~/.env')}")

# Magic word for inline visualizations
%matplotlib inline

# Set default figure size for better readability
sb.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

In [53]:
## Data gathering and display functions
def download_and_save_file(url, local_filename, path="data/raw", download_if_exists=False):
    """
    Downloads a file from a URL and saves it locally.
    By default the file is saved to the data/raw path.
    
    If the file is gzipped, it's decompressed on-the-fly and
    written as uncompressed data.

    Args:
        url (str): URL of file to download.
        local_filename (str): Name of file to save locally.
    """
    filepath = f"{path}/{local_filename}"
    
    if os.path.exists(filepath) and download_if_exists is False:
        print(f"{filepath} already exists. If you wish to re-download it, set download_if_exists=True")
        return
    
    try:
        with httpx.stream("GET", url, headers={"Accept-Encoding": "gzip"}) as response:
            # Raise an exception for bad status codes
            response.raise_for_status()

            headers = response.headers
            is_gzip = False
            
            print(f"Downloading {url} --> {filepath}")

            # Check if response is gzipped
            if 'content-encoding' in headers and 'gzip' in headers['content-encoding']:
                    is_gzip = True

            elif 'content-type' in headers and 'binary/octet-stream' in headers['content-type']:                
                # AWS likes to send this back; just check the extension
                _, extension = os.path.splitext(urlparse(url).path)

                if extension == '.gz':
                    is_gzip = True

            # Decompress file if it's gzipped
            if is_gzip is True:
                print("Decompressing gzip content...")

                decompressor = zlib.decompressobj(wbits=zlib.MAX_WBITS | 16)
                download_size = 0
                count = 0

                with open(filepath, "wb") as outfile:
                    for chunk in response.iter_bytes():
                        count += 1
                        download_size += len(chunk)
                        decompressed_chunk = decompressor.decompress(chunk)

                        if download_size % 1000000 == 0:
                            print(f"Downloaded bytes: {download_size:,}")

                        outfile.write(decompressed_chunk)

                print(f"Downloaded {count:,} chunks for file size {download_size:,}")
                
                remaining_data = decompressor.flush()
                if remaining_data:
                    with open(filepath, "wb") as outfile:
                        outfile.write(remaining_data)


            # If not gzipped, just write raw content to the output file
            else:
                with open(filepath, 'wb') as outfile:
                    for chunk in response.iter_bytes():
                        outfile.write(chunk)

            # Give the user feedback
            print(f"File '{local_filename}' downloaded successfully from '{url}'")

    # Check for various issues
    except httpx.HTTPStatusError as e:
        print(f"HTTP error occurred: {e}")
        
    except httpx.RequestError as e:
        print(f"An error occurred while requesting {url}: {e}")
        
    except IOError as e:
        print(f"Error saving file '{local_filename}': {e}")

def display_shape(df):
    """
    Output rows and columns of dataframe
    """
    print(f"\nRows: {df.shape[0]:,}")
    print(f"Columns: {df.shape[1]:,}\n")

#### TMDb Movies Dataset

The TMDb (The Movie Database) dataset provides comprehensive information about 10,000 theatrical movies, representing the foundation of this analysis. This dataset is particularly valuable because it includes both original and inflation-adjusted financial figures (budget_adj and revenue_adj to 2010 dollars), extensive production metadata, and audience reception metrics. TMDb's community-driven model ensures broad coverage of theatrical releases while maintaining high data quality through active curation.

Source URL: https://www.kaggle.com/datasets/juzershakir/tmdb-movies-dataset

Type: CSV (6.85 MB)

Method: Manual download from Kaggle

**Dataset variables:**
- id (int) - TMDb unique identifier
- imdb_id (string) - IMDb identifier for cross-dataset linking
- popularity (float) - TMDb popularity score
- budget (int) - Production budget in nominal dollars
- revenue (int) - Total box office revenue in nominal dollars
- original_title (string) - Original title
- cast (string) - Cast members (pipe-delimited)
- homepage (string) - Official movie homepage URL
- director (string) - Director name
- tagline (string) - Marketing catchphrase
- keywords (string) - Descriptive keywords (pipe-delimited)
- overview (string) - Plot synopsis
- runtime (int) - Duration in minutes
- genres (string) - Movie genres (pipe-delimited)
- production_companies (string) - Production companies (pipe-delimited)
- release_date (string) - Release date (format: 6/9/2015)
- vote_count (int) - Total number of TMDb user votes
- vote_average (float) - Average TMDb user rating (0-10 scale)
- release_year (int) - Release year
- budget_adj (float) - Budget adjusted to 2010 dollars
- revenue_adj (float) - Revenue adjusted to 2010 dollars

**Why this dataset:**
TMDb serves as the core theatrical release dataset because it combines financial data (budget/revenue), audience reception (ratings/votes), production details (cast/director/companies), and content characteristics (genres/runtime) in a single source. The pre-existing inflation adjustment to 2010 demonstrates awareness of temporal comparison challenges, though this project will re-adjust all figures to 2025 dollars for consistency. The imdb_id field enables precise linking to IMDb data, while the comprehensive metadata supports fuzzy matching to streaming platforms where IDs are absent.

**Significance of variables:**
- **id/imdb_id:** Primary keys for merging with IMDb and tracking across datasets
- **budget/revenue (and _adj versions):** Core financial metrics for profitability analysis; will be re-adjusted to 2025 dollars
- **vote_average/vote_count:** TMDb's rating system for comparison with IMDb's averageRating/numVotes
- **genres/runtime:** Content characteristics for analyzing streaming vs theatrical patterns
- **cast/director:** Enable talent crossover analysis between distribution channels
- **release_year/release_date:** Temporal analysis and fuzzy matching keys
- **popularity:** TMDb-specific engagement metric not available in other datasets

In [54]:
# Load dataset
tmdb_movies_raw_df = pd.read_csv('data/raw/tmdb_movies_data.csv')

display_shape(tmdb_movies_raw_df)
display(tmdb_movies_raw_df.head())


Rows: 10,866
Columns: 21



Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


#### IMDb Non-Commercial Datasets

IMDb provides free access to subsets of their data for non-commercial use, refreshed daily. This dataset extends the TMDb data with additional ratings and metadata, and provides validation for fuzzy matching when IMDb IDs are missing. IMDb's community-driven ratings represent a different audience demographic than TMDb, enabling comparison of rating systems.

Source URL: https://datasets.imdbws.com/

Type: TSV (Tab-Separated Values), gzip-compressed

Method: Programmatic download using Python httpx library

**Files Used:**

**title.basics.tsv.gz** - Core movie metadata
   - tconst (string) - IMDb unique identifier (e.g., tt0111161)
   - titleType (string) - Format type (movie, short, tvSeries, etc.)
   - primaryTitle (string) - Popular title used in marketing
   - originalTitle (string) - Original language title
   - isAdult (boolean) - Adult content flag (0 or 1)
   - startYear (YYYY) - Release year
   - runtimeMinutes (int) - Duration in minutes
   - genres (string array) - Up to three genres (comma-separated)

**title.ratings.tsv.gz** - Community ratings
   - tconst (string) - IMDb unique identifier
   - averageRating (float) - Weighted average rating (1-10 scale)
   - numVotes (int) - Total number of votes

**Why this dataset:**
IMDb's rating system is well-established and trusted, providing an alternative measure of audience reception. The tconst identifier enables precise linking to TMDb via imdb_id field, while the comprehensive coverage helps identify theatrical releases that may be missing from TMDb. The daily-updated nature ensures current data.

**Significance of variables:**
- **tconst/averageRating/numVotes:** Enable comparison of IMDb vs TMDb rating systems and vote engagement
- **titleType:** Filters dataset to movies only (excluding TV series, shorts)
- **startYear/runtimeMinutes:** Provide additional data points for movies missing this info in TMDb
- **genres:** Cross-validation of genre classifications across systems

In [55]:
# Get files via URL download
URL = "https://datasets.imdbws.com/title.basics.tsv.gz"
LOCAL_FILE = "imdb_title_basics.tsv"

download_and_save_file(URL, LOCAL_FILE)

# Load dataset - note that this dataset is quite large at almost 1GB on disk
imdb_title_basics_raw_df = pd.read_csv(f'data/raw/{LOCAL_FILE}', sep='\t')

display_shape(imdb_title_basics_raw_df)
display(imdb_title_basics_raw_df.head())

data/raw/imdb_title_basics.tsv already exists. If you wish to re-download it, set download_if_exists=True

Rows: 12,116,115
Columns: 9



Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


In [56]:
# Get files via URL download
URL = "https://datasets.imdbws.com/title.ratings.tsv.gz"
LOCAL_FILE="imdb_title_ratings.tsv"

download_and_save_file(URL, LOCAL_FILE)

# Load dataset
imdb_title_ratings_raw_df = pd.read_csv(f'data/raw/{LOCAL_FILE}', sep='\t')

display_shape(imdb_title_ratings_raw_df)
display(imdb_title_ratings_raw_df.head())

data/raw/imdb_title_ratings.tsv already exists. If you wish to re-download it, set download_if_exists=True

Rows: 1,607,373
Columns: 3



Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2187
1,tt0000002,5.5,307
2,tt0000003,6.4,2273
3,tt0000004,5.2,196
4,tt0000005,6.2,3012


#### Streaming Platform Catalogs

Four major streaming platform datasets provide comprehensive coverage of content available on Netflix, Hulu, Disney+, and Amazon Prime as of mid-2021. These datasets enable identification of streaming-exclusive content (never released theatrically) versus theatrical films later added to streaming platforms. The consistent schema across platforms facilitates comparison of content strategies.

**Datasets:**
- Netflix: https://www.kaggle.com/datasets/shivamb/netflix-shows (~8K titles)
- Hulu: https://www.kaggle.com/datasets/shivamb/hulu-movies-and-tv-shows (~3K titles)  
- Disney+: https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows (~1.3K titles)
- Amazon Prime: https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows (~10K titles)

Type: CSV

Method: Manual download from Kaggle (all four datasets)

**Shared Schema:**
- show_id (string) - Unique identifier per platform
- type (string) - "Movie" or "TV Show"
- title (string) - Content title
- director (string) - Director name(s)
- cast (string) - Cast members (comma-separated)
- country (string) - Production country
- date_added (string) - Date added to platform (e.g., "September 25, 2021")
- release_year (int) - Original release year
- rating (string) - Content rating (PG-13, TV-MA, etc.)
- duration (string) - Runtime (e.g., "90 min" or "2 Seasons")
- listed_in (string) - Genres (comma-separated)
- description (string) - Synopsis

**Why these datasets:**
Streaming platforms represent the industry's future, but comprehensive availability data isn't accessible via APIs without commercial partnerships. These Kaggle datasets provide a snapshot of four major platforms' catalogs at a single point in time (2021), enabling analysis of platform content strategies. The lack of consistent identifiers (no TMDb/IMDb IDs) creates an authentic data wrangling challenge requiring fuzzy matching.

**Significance of variables:**
- **type:** Filters to movies only (excluding TV shows)
- **title/release_year:** Primary keys for fuzzy matching to theatrical databases
- **director/cast:** Enable cross-validation of matches and analysis of talent migration to streaming
- **date_added:** Indicates when content became available (may differ from release_year)
- **listed_in:** Genre analysis comparing streaming vs theatrical content preferences
- **duration:** Identifies potential data quality issues (TV shows mis-labeled as movies)

In [57]:
# I decided to manually load these datasets into the project.
# Working with the kaggle and kagglehub Python packages is more trouble than it's worth.

amazon_prime_titles_raw_df = pd.read_csv('data/raw/amazon_prime_titles.csv')
display_shape(amazon_prime_titles_raw_df)
display(amazon_prime_titles_raw_df.head())


Rows: 9,668
Columns: 12



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...


In [58]:
disney_plus_titles_raw_df = pd.read_csv('data/raw/disney_plus_titles.csv')
display_shape(disney_plus_titles_raw_df)
display(disney_plus_titles_raw_df.head())


Rows: 1,450
Columns: 12



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",,"November 26, 2021",2016,TV-G,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",,"November 26, 2021",1988,PG,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,TV-G,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",,"November 26, 2021",2021,TV-PG,41 min,Musical,"This is real life, not just fantasy!"
4,s5,TV Show,The Beatles: Get Back,,"John Lennon, Paul McCartney, George Harrison, ...",,"November 25, 2021",2021,,1 Season,"Docuseries, Historical, Music",A three-part documentary from Peter Jackson ca...


In [59]:
hulu_titles_raw_df = pd.read_csv('data/raw/hulu_titles.csv')
display_shape(hulu_titles_raw_df)
display(hulu_titles_raw_df.head())


Rows: 3,073
Columns: 12



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Ricky Velez: Here's Everything,,,,"October 24, 2021",2021,TV-MA,,"Comedy, Stand Up",​Comedian Ricky Velez bares it all with his ho...
1,s2,Movie,Silent Night,,,,"October 23, 2021",2020,,94 min,"Crime, Drama, Thriller","Mark, a low end South London hitman recently r..."
2,s3,Movie,The Marksman,,,,"October 23, 2021",2021,PG-13,108 min,"Action, Thriller",A hardened Arizona rancher tries to protect an...
3,s4,Movie,Gaia,,,,"October 22, 2021",2021,R,97 min,Horror,A forest ranger and two survivalists with a cu...
4,s5,Movie,Settlers,,,,"October 22, 2021",2021,,104 min,"Science Fiction, Thriller",Mankind's earliest settlers on the Martian fro...


In [60]:
netflix_titles_raw_df = pd.read_csv('data/raw/netflix_titles.csv')
display_shape(netflix_titles_raw_df)
display(netflix_titles_raw_df.head())


Rows: 8,807
Columns: 12



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


#### Consumer Price Index (CPI) for Inflation Adjustment

The Federal Reserve Economic Data (FRED) provides historical Consumer Price Index data necessary for adjusting movie budgets and revenues to constant 2025 dollars. Without inflation adjustment, comparing a 1970s film budget to a 2020s budget is meaningless. The CPI-U (Consumer Price Index for All Urban Consumers) is the standard measure used by economists for this purpose.

Source URL: https://fred.stlouisfed.org/series/CPIAUCSL

Type: JSON (API response)

Method: API access via FRED API (programmatic)

**Dataset variables:**
- date (string) - Observation date (YYYY-MM-DD format)
- value (float) - CPI index value (1982-84 = 100 baseline)
- year (int) - Year extracted from date
- cpi (float) - Annual average CPI value

**Why this dataset:**
Inflation has been substantial over the 65-year span of this analysis (1960-2025). A \\$1 million budget in 1960 equals approximately \\$10.4 million in 2025 dollars. Without adjustment, financial analysis would be dominated by recency bias, making modern films appear far more expensive than historical films when they may actually be comparable in real terms. The FRED API provides authoritative, regularly-updated data from the U.S. Bureau of Labor Statistics.

**Significance of variables:**
- **year/cpi:** Used to calculate inflation multiplier: `adjusted_amount = original * (CPI_2025 / CPI_year)`
- **Annual averaging:** Smooths monthly volatility for more stable year-over-year comparisons

**Adjustment formula:**
```
budget_adj_2025 = budget_original × (CPI_2025 / CPI_release_year)
revenue_adj_2025 = revenue_original × (CPI_2025 / CPI_release_year)
```

In [43]:
def get_cpi_data(start_year=1960, end_year=2025, local_filename='cpi_data.json', path="data/raw", download_if_exists=False):
    """
    Get CPI data from FRED API
    CPI-U (Consumer Price Index for all Urban consumers)
    """
    
    filepath = f"{path}/{local_filename}"
    
    # Get cached data if we have it
    if os.path.exists(filepath) and download_if_exists is False:
        print(f"{filepath} already exists. If you wish to re-download it, set download_if_exists=True")
        return pd.read_json(filepath)
    
    url = 'https://api.stlouisfed.org/fred/series/observations'
    params = {
        'series_id': 'CPIAUCSL',
        'api_key': os.getenv('FRED_API'),
        'file_type': 'json',
        'observation_start': f'{start_year}-01-01',
        'observation_end': f'{end_year}-12-31',
        'frequency': 'a'
    }
    
    try:
        # Get the data as JSON
        response = httpx.get(url, params=params)
        response.raise_for_status()

        data = response.json()

        # Create dataframe
        cpi_df = pd.DataFrame(data['observations'])
        
        # Remove invalid data rows
        cpi_df = cpi_df[cpi_df['value'] != "."]
        
        # Transform data
        cpi_df['year'] = pd.to_datetime(cpi_df['date']).dt.year
        cpi_df['cpi'] = pd.to_numeric(cpi_df['value'])
        
        cpi_df = cpi_df[['year', 'cpi']]
        
        # Cache transformed data to disk
        cpi_df.to_json(filepath, orient='records')

        return cpi_df
        
    except httpx.HTTPStatusError as e:
        print(f"Error during request: {e}")
        
    except httpx.RequestError as e:
        print(f"An error occurred while requesting: {e}")

In [61]:
cpi_df = get_cpi_data()
display_shape(cpi_df)
display(cpi_df.head())

data/raw/cpi_data.json already exists. If you wish to re-download it, set download_if_exists=True

Rows: 65
Columns: 2



Unnamed: 0,year,cpi
0,1960,29.585
1,1961,29.902
2,1962,30.253
3,1963,30.633
4,1964,31.038


## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Quality Issue 2:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 1:

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

### Tidiness Issue 2: 

In [None]:
#FILL IN - Inspecting the dataframe visually

In [None]:
#FILL IN - Inspecting the dataframe programmatically

Issue and justification: *FILL IN*

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [None]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes 
# are not impacted

### **Quality Issue 1: FILL IN**

In [None]:
# FILL IN - Apply the cleaning strategy

In [None]:
# FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Quality Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [None]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [None]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [None]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN

In [None]:
#Visual 2 - FILL IN

*Answer to research question:* FILL IN

### **5.2:** Reflection
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

*Answer:* FILL IN