# Real-world Data Wrangling

In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.

You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.

Throughout the process, you are expected to:

1. Explain your decisions towards methods used for gathering, assessing, cleaning, storing, and answering the research question
2. Write code comments so your code is more readable

Before you start, install the some of the required packages. 

In [157]:
!python -m pip install kaggle==1.6.12



In [158]:
import os 
import shutil

In [159]:
#Defined path for kaggle directory
kaggle_dir =os.path.expanduser('~/.kaggle')
#Create directory
os.makedirs(kaggle_dir, exist_ok=True)

In [160]:
#Move the kaggle.json file to the directory. updating the source path.
source_path = '/Users/leonardoflores/Downloads/kaggle.json'
destination_path = os.path.join(kaggle_dir, 'kaggle.json')

In [161]:
#Copying the kaggle.json to destination
shutil.copy(source_path, destination_path)

'/Users/leonardoflores/.kaggle/kaggle.json'

In [162]:
# Change permission to secure the file
os.chmod(destination_path, 0o600)

In [163]:
# Verify that the API worked
!kaggle datasets list

ref                                                           title                                            size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------  ----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
muhammadroshaanriaz/students-performance-dataset-cleaned      Students Performance | Clean Dataset             10KB  2024-10-29 19:32:26           9803        179  1.0              
ikynahidwin/depression-professional-dataset                   Depression Professional Dataset                  16KB  2024-11-20 06:39:18           1121         26  1.0              
steve1215rogg/student-lifestyle-dataset                       student lifestyle dataset                        22KB  2024-11-11 19:11:28           4608         68  1.0              
ikynahidwin/depression-student-dataset                        Depression Student Dataset  

In [164]:
import pandas as pd
import json

## 1. Gather data

In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.

### **1.1.** Problem Statement
In 2-4 sentences, explain the kind of problem you want to look at and the datasets you will be wrangling for this project.

*FILL IN:*

### **1.2.** Gather at least two datasets using two different data gathering methods

List of data gathering methods:

- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database

Each dataset must have at least two variables, and have greater than 500 data samples within each dataset.

For each dataset, briefly describe why you picked the dataset and the gathering method (2-3 full sentences), including the names and significance of the variables in the dataset. Show your work (e.g., if using an API to download the data, please include a snippet of your code). 

Load the dataset programmtically into this notebook.

#### **Dataset 1**

Type: CSV file

Method: The data was gathered using the "API" method from "Kaggle.com" source.

Dataset variables:

*   *Variable 1 FILL IN* (e.g., H_MEAN: Mean hourly wage)
*   *Variable 2 FILL IN*

In [165]:
! kaggle datasets list -s 'TMDB'

ref                                                      title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
-------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
tmdb/tmdb-movie-metadata                                 TMDB 5000 Movie Dataset                               9MB  2017-09-28 01:09:12         392482       3880  0.8235294        
asaniczka/tmdb-movies-dataset-2023-930k-movies           Full TMDB Movies Dataset 2024 (1M Movies)           203MB  2024-11-24 09:38:29          13352        420  1.0              
juzershakir/tmdb-movies-dataset                          TMDb Movies Dataset                                   3MB  2018-03-24 04:18:19           6275         55  0.7058824        
ahsanaseer/top-rated-tmdb-movies-10k                     TMDB Movies Dataset                   

In [166]:
!kaggle datasets download -d 'tmdb/tmdb-movie-metadata'

Dataset URL: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
License(s): other
tmdb-movie-metadata.zip: Skipping, found more recently modified local copy (use --force to force download)


In [167]:
import zipfile
with zipfile.ZipFile('tmdb-movie-metadata.zip','r') as zip_ref:
    zip_ref.extractall('/Users/leonardoflores/Desktop/Project two')

In [168]:
#List all files and subdirectories
location = '/Users/leonardoflores/Desktop/Project two'
unzipped_files =os.listdir(location)
for file in unzipped_files:
    print(file)
    

.DS_Store
data_wrangling_project_starter (1).ipynb
tmdb_5000_credits.csv
tmdb_5000_movies.csv
tmdb-movie-metadata.zip
.ipynb_checkpoints


In [169]:
dataframe_one = pd.read_csv('tmdb_5000_credits.csv')
dataframe_one.set_index('movie_id',inplace=True)
dataframe_one.head(10)

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
559,Spider-Man 3,"[{""cast_id"": 30, ""character"": ""Peter Parker / ...","[{""credit_id"": ""52fe4252c3a36847f80151a5"", ""de..."
38757,Tangled,"[{""cast_id"": 34, ""character"": ""Flynn Rider (vo...","[{""credit_id"": ""52fe46db9251416c91062101"", ""de..."
99861,Avengers: Age of Ultron,"[{""cast_id"": 76, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de..."
767,Harry Potter and the Half-Blood Prince,"[{""cast_id"": 3, ""character"": ""Harry Potter"", ""...","[{""credit_id"": ""52fe4273c3a36847f801fab1"", ""de..."
209112,Batman v Superman: Dawn of Justice,"[{""cast_id"": 18, ""character"": ""Bruce Wayne / B...","[{""credit_id"": ""553bf23692514135c8002886"", ""de..."


In [170]:
dataframe_one.shape

(4803, 3)

#### Dataset 2

Type: CSV File.

Method: The data was gathered using the "Downloading files" manually method from "Kaggle.com". 


Dataset variables:

*   *Variable 1 FILL IN* (e.g., H_MEAN: Mean hourly wage)
*   *Variable 2 FILL IN*

In [171]:
#FILL IN 2nd data gathering and loading method
dataframe_two = pd.read_csv('tmdb_5000_movies.csv')
dataframe_two.set_index('id',inplace=True)
dataframe_two.head(10)


Unnamed: 0_level_0,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
206647,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124
559,258000000,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...",http://www.sonypictures.com/movies/spider-man3/,"[{""id"": 851, ""name"": ""dual identity""}, {""id"": ...",en,Spider-Man 3,The seemingly invincible Spider-Man goes up ag...,115.699814,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-01,890871626,139.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,The battle within.,Spider-Man 3,5.9,3576
38757,260000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://disney.go.com/disneypictures/tangled/,"[{""id"": 1562, ""name"": ""hostage""}, {""id"": 2343,...",en,Tangled,When the kingdom's most wanted-and most charmi...,48.681969,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2010-11-24,591794936,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,They're taking adventure to new lengths.,Tangled,7.4,3330
99861,280000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://marvel.com/movies/movie/193/avengers_ag...,"[{""id"": 8828, ""name"": ""marvel comic""}, {""id"": ...",en,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...,134.279229,"[{""name"": ""Marvel Studios"", ""id"": 420}, {""name...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2015-04-22,1405403694,141.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Age Has Come.,Avengers: Age of Ultron,7.3,6767
767,250000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://harrypotter.warnerbros.com/harrypottera...,"[{""id"": 616, ""name"": ""witch""}, {""id"": 2343, ""n...",en,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he...",98.885637,"[{""name"": ""Warner Bros."", ""id"": 6194}, {""name""...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2009-07-07,933959197,153.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Dark Secrets Revealed,Harry Potter and the Half-Blood Prince,7.4,5293
209112,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.batmanvsupermandawnofjustice.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 7002...",en,Batman v Superman: Dawn of Justice,Fearing the actions of a god-like Super Hero l...,155.790452,"[{""name"": ""DC Comics"", ""id"": 429}, {""name"": ""A...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2016-03-23,873260194,151.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Justice or revenge,Batman v Superman: Dawn of Justice,5.7,7004


In [172]:
dataframe_two.shape

(4803, 19)

Optional data storing step: You may save your raw dataset files to the local data store before moving to the next step.

In [173]:
#Optional: store the raw data in your local data store

## 2. Assess data

Assess the data according to data quality and tidiness metrics using the report below.

List **two** data quality issues and **two** tidiness issues. Assess each data issue visually **and** programmatically, then briefly describe the issue you find.  **Make sure you include justifications for the methods you use for the assessment.**

### Quality Issue 1:

In [174]:
#FILL IN - Inspecting the dataframe visually
dataframe_one.head(10)

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
559,Spider-Man 3,"[{""cast_id"": 30, ""character"": ""Peter Parker / ...","[{""credit_id"": ""52fe4252c3a36847f80151a5"", ""de..."
38757,Tangled,"[{""cast_id"": 34, ""character"": ""Flynn Rider (vo...","[{""credit_id"": ""52fe46db9251416c91062101"", ""de..."
99861,Avengers: Age of Ultron,"[{""cast_id"": 76, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de..."
767,Harry Potter and the Half-Blood Prince,"[{""cast_id"": 3, ""character"": ""Harry Potter"", ""...","[{""credit_id"": ""52fe4273c3a36847f801fab1"", ""de..."
209112,Batman v Superman: Dawn of Justice,"[{""cast_id"": 18, ""character"": ""Bruce Wayne / B...","[{""credit_id"": ""553bf23692514135c8002886"", ""de..."


In [175]:
#FILL IN - Inspecting the dataframe programmatically
dataframe_one.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 19995 to 25975
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   4803 non-null   object
 1   cast    4803 non-null   object
 2   crew    4803 non-null   object
dtypes: object(3)
memory usage: 150.1+ KB


In [176]:
dataframe_one.isnull().sum()

title    0
cast     0
crew     0
dtype: int64

In [209]:
sum(dataframe_one.duplicated())

0

Issue and justification: 

Columns 'cast' and 'crew' contain JSON-like strings , which contains nested data. Within the nested data appear to contain important data such as  "cast_id" and "credit_id". Without pasing the columns with JSON values there will be an inability to answer my question to the fullest extent. 

### Quality Issue 2:

In [177]:
#FILL IN - Inspecting the dataframe visually
dataframe_two.sample(10)

Unnamed: 0_level_0,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
9801,50000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,"[{""id"": 212, ""name"": ""london england""}, {""id"":...",en,Bridget Jones: The Edge of Reason,Bridget Jones is becoming uncomfortable in her...,28.803802,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""iso_3166_1"": ""FR"", ""name"": ""France""}, {""iso...",2004-11-10,40203020,108.0,"[{""iso_639_1"": ""de"", ""name"": ""Deutsch""}, {""iso...",Released,Same Bridget. Brand new diary.,Bridget Jones: The Edge of Reason,6.1,750
22832,40000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://ninja-assassin-movie.warnerbros.com/,"[{""id"": 441, ""name"": ""assassination""}, {""id"": ...",en,Ninja Assassin,"Ninja Assassin follows Raizo (Rain), one of th...",19.085403,"[{""name"": ""Studio Babelsberg"", ""id"": 264}, {""n...","[{""iso_3166_1"": ""DE"", ""name"": ""Germany""}, {""is...",2009-09-29,60462347,99.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Fear not the weapon, but the hand that wields it.",Ninja Assassin,6.2,371
2610,10284523,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,"[{""id"": 1361, ""name"": ""salesclerk""}]",en,Shopgirl,Mirabelle is a disenchanted salesgirl and aspi...,4.820166,"[{""name"": ""Epsilon Motion Pictures"", ""id"": 117...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2005-10-21,11588205,104.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,,Shopgirl,5.7,75
109414,32000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...",http://www.thisistheend.com/,"[{""id"": 570, ""name"": ""rape""}, {""id"": 1299, ""na...",en,This Is the End,While attending a party at James Franco's hous...,38.21708,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2013-06-12,126041322,107.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Nothing ruins a party like the end of the world.,This Is the End,6.3,2359
9981,45000000,"[{""id"": 10749, ""name"": ""Romance""}, {""id"": 35, ...",,"[{""id"": 494, ""name"": ""father son relationship""...",en,Kicking & Screaming,Phil Weston has been unathletic his entire lif...,8.99269,"[{""name"": ""Universal Pictures"", ""id"": 33}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2005-05-13,56070433,95.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,One man could lead this team to glory. . . Tha...,Kicking & Screaming,5.6,176
14112,7000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",,"[{""id"": 255, ""name"": ""male nudity""}, {""id"": 59...",en,Auto Focus,"A successful TV star during the 1960s, former ...",7.051521,"[{""name"": ""Sony Pictures Classics"", ""id"": 58},...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2002-10-18,2062066,104.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A day without sex is a day wasted.,Auto Focus,6.1,56
8665,100000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 36, ""name...",,"[{""id"": 339, ""name"": ""submarine""}, {""id"": 2111...",en,K-19: The Widowmaker,When Russia's first nuclear submarine malfunct...,15.625949,"[{""name"": ""Paramount Pictures"", ""id"": 4}, {""na...","[{""iso_3166_1"": ""CA"", ""name"": ""Canada""}, {""iso...",2002-07-19,35168966,138.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Fate has found its hero.,K-19: The Widowmaker,6.1,264
9693,76000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 28, ""name...",http://www.universalstudiosentertainment.com/c...,"[{""id"": 318, ""name"": ""police state""}, {""id"": 4...",en,Children of Men,"In 2027, in a chaotic world in which humans ca...",35.387874,"[{""name"": ""Universal Pictures"", ""id"": 33}, {""n...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2006-09-22,69959751,109.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The future's a thing of the past.,Children of Men,7.4,2071
4912,30000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 80, ""nam...",,"[{""id"": 4688, ""name"": ""microfilm""}, {""id"": 556...",en,Confessions of a Dangerous Mind,"Television made him famous, but his biggest hi...",18.165957,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2002-12-30,33013805,113.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Some things are better left top secret.,Confessions of a Dangerous Mind,6.6,275
9823,30000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 14, ""nam...",,"[{""id"": 779, ""name"": ""martial arts""}, {""id"": 1...",en,Mortal Kombat: Annihilation,A group of heroic warriors has only six days t...,11.916389,"[{""name"": ""New Line Cinema"", ""id"": 12}, {""name...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1997-11-21,51376861,95.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"The world was created in six days, so too shal...",Mortal Kombat: Annihilation,3.8,221


In [178]:
#FILL IN - Inspecting the dataframe programmatically
dataframe_two.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 19995 to 25975
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4800 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   production_countries  4803 non-null   object 
 10  release_date          4802 non-null   object 
 11  revenue               4803 non-null   int64  
 12  runtime               4801 non-null   float64
 13  spoken_languages      4803 non-null   object 
 14  status                4803 non-null   object 
 15  tagline         

In [179]:
dataframe_two.isnull().sum()

budget                     0
genres                     0
homepage                3091
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

In [211]:
sum(dataframe_two.duplicated())

0

Issue and justification: 

Columns like 'genres' and 'production_companies' contain JSON-like strings, in other words they contain nested data. Pasing the columns with JSON values will deepen the understanding of our dataset. Without parsing, the data will have inconsistent data format which not allow the ability to have structured data. 

The 'homepage' and 'tagline' columns had a large amount of missing values.  Additional columns 'overview',release_date and 'runtime' only has a few missing values. These data gaps might affect our analysis later down in the visualization process. It would probably be best to drop 'homepage' and 'tagline' columns since there not relevent to the question that we want to answer.   


### Tidiness Issue 1:

In [180]:
#FILL IN - Inspecting the dataframe visually
dataframe_one[dataframe_one['cast'].duplicated(keep=False)]

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17644,Barney's Great Adventure,[],"[{""credit_id"": ""52fe473b9251416c750921ff"", ""de..."
447027,Running Forever,[],[]
346081,Sardaarji,[],"[{""credit_id"": ""558ab3f4925141076f0001d7"", ""de..."
126509,2016: Obama's America,[],"[{""credit_id"": ""52fe4ae1c3a368484e16bcd7"", ""de..."
371085,Sharkskin,[],[]
325140,Hum To Mohabbat Karega,[],[]
361505,Me You and Five Bucks,[],[]
114065,Down & Out With The Dolls,[],[]
137955,Crowsnest,[],[]
102840,Sex With Strangers,[],[]


In [181]:
#FILL IN - Inspecting the dataframe visually
dataframe_one[dataframe_one['crew'].duplicated(keep=False)]

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19615,Flying By,"[{""cast_id"": 1, ""character"": ""George"", ""credit...",[]
447027,Running Forever,[],[]
55831,Boynton Beach Club,"[{""cast_id"": 1, ""character"": ""Marilyn"", ""credi...",[]
371085,Sharkskin,[],[]
48382,"The Book of Mormon Movie, Volume 1: The Journey","[{""cast_id"": 1, ""character"": ""Sam"", ""credit_id...",[]
325140,Hum To Mohabbat Karega,[],[]
20653,Roadside Romeo,"[{""cast_id"": 1, ""character"": ""Romeo"", ""credit_...",[]
361505,Me You and Five Bucks,[],[]
114065,Down & Out With The Dolls,[],[]
137955,Crowsnest,[],[]


In [182]:
#FILL IN - Inspecting the dataframe programmatically
json.loads(dataframe_one['cast'].iloc[0])

[{'cast_id': 242,
  'character': 'Jake Sully',
  'credit_id': '5602a8a7c3a3685532001c9a',
  'gender': 2,
  'id': 65731,
  'name': 'Sam Worthington',
  'order': 0},
 {'cast_id': 3,
  'character': 'Neytiri',
  'credit_id': '52fe48009251416c750ac9cb',
  'gender': 1,
  'id': 8691,
  'name': 'Zoe Saldana',
  'order': 1},
 {'cast_id': 25,
  'character': 'Dr. Grace Augustine',
  'credit_id': '52fe48009251416c750aca39',
  'gender': 1,
  'id': 10205,
  'name': 'Sigourney Weaver',
  'order': 2},
 {'cast_id': 4,
  'character': 'Col. Quaritch',
  'credit_id': '52fe48009251416c750ac9cf',
  'gender': 2,
  'id': 32747,
  'name': 'Stephen Lang',
  'order': 3},
 {'cast_id': 5,
  'character': 'Trudy Chacon',
  'credit_id': '52fe48009251416c750ac9d3',
  'gender': 1,
  'id': 17647,
  'name': 'Michelle Rodriguez',
  'order': 4},
 {'cast_id': 8,
  'character': 'Selfridge',
  'credit_id': '52fe48009251416c750ac9e1',
  'gender': 2,
  'id': 1771,
  'name': 'Giovanni Ribisi',
  'order': 5},
 {'cast_id': 7,
  'c

In [183]:
#FILL IN - Inspecting the dataframe programmatically
json.loads(dataframe_one['crew'].iloc[0])

[{'credit_id': '52fe48009251416c750aca23',
  'department': 'Editing',
  'gender': 0,
  'id': 1721,
  'job': 'Editor',
  'name': 'Stephen E. Rivkin'},
 {'credit_id': '539c47ecc3a36810e3001f87',
  'department': 'Art',
  'gender': 2,
  'id': 496,
  'job': 'Production Design',
  'name': 'Rick Carter'},
 {'credit_id': '54491c89c3a3680fb4001cf7',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Sound Designer',
  'name': 'Christopher Boyes'},
 {'credit_id': '54491cb70e0a267480001bd0',
  'department': 'Sound',
  'gender': 0,
  'id': 900,
  'job': 'Supervising Sound Editor',
  'name': 'Christopher Boyes'},
 {'credit_id': '539c4a4cc3a36810c9002101',
  'department': 'Production',
  'gender': 1,
  'id': 1262,
  'job': 'Casting',
  'name': 'Mali Finn'},
 {'credit_id': '5544ee3b925141499f0008fc',
  'department': 'Sound',
  'gender': 2,
  'id': 1729,
  'job': 'Original Music Composer',
  'name': 'James Horner'},
 {'credit_id': '52fe48009251416c750ac9c3',
  'department': 'Directing',
  

In [184]:
dataframe_one.columns

Index(['title', 'cast', 'crew'], dtype='object')

Issue and justification: 

As described in this section of advanced data wrangling, there are three key principles that data structures should adhere too. These rules are each variable should have its own column, each observation should have its own row, and lastly each type of obervational unit should form a separate table.

With this in mind, once we properly parse through the JSON columns, we will have multiple observational units one table. This can create issues with our analysis of the data and the repeated 'movie_id' will introduce unwanted redundancy. We must seperate the tables into the 'movie' table and two additional tables for each column that has JSON values. In order words, an additional tables name 'cast' and 'crew'.   

### Tidiness Issue 2: 

In [185]:
#FILL IN - Inspecting the dataframe visually
dataframe_two[dataframe_two['production_companies'].duplicated(keep=False)]

Unnamed: 0_level_0,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124
559,258000000,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...",http://www.sonypictures.com/movies/spider-man3/,"[{""id"": 851, ""name"": ""dual identity""}, {""id"": ...",en,Spider-Man 3,The seemingly invincible Spider-Man goes up ag...,115.699814,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-01,890871626,139.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,The battle within.,Spider-Man 3,5.9,3576
38757,260000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://disney.go.com/disneypictures/tangled/,"[{""id"": 1562, ""name"": ""hostage""}, {""id"": 2343,...",en,Tangled,When the kingdom's most wanted-and most charmi...,48.681969,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2010-11-24,591794936,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,They're taking adventure to new lengths.,Tangled,7.4,3330
767,250000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://harrypotter.warnerbros.com/harrypottera...,"[{""id"": 616, ""name"": ""witch""}, {""id"": 2343, ""n...",en,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he...",98.885637,"[{""name"": ""Warner Bros."", ""id"": 6194}, {""name""...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2009-07-07,933959197,153.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Dark Secrets Revealed,Harry Potter and the Half-Blood Prince,7.4,5293
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14337,7000,"[{""id"": 878, ""name"": ""Science Fiction""}, {""id""...",http://www.primermovie.com,"[{""id"": 1448, ""name"": ""distrust""}, {""id"": 2101...",en,Primer,Friends/fledgling entrepreneurs invent a devic...,23.307949,"[{""name"": ""Thinkfilm"", ""id"": 446}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2004-10-08,424760,77.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,What happens if it actually works?,Primer,6.9,658
67238,0,"[{""id"": 10769, ""name"": ""Foreign""}, {""id"": 53, ...",,[],en,Cavite,"Adam, a security guard, travels from Californi...",0.022173,[],[],2005-03-12,0,80.0,[],Released,,Cavite,7.5,2
9367,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]","[{""iso_3166_1"": ""MX"", ""name"": ""Mexico""}, {""iso...",1992-09-04,2040920,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238
72766,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],[],2011-12-26,0,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5


In [186]:
#FILL IN - Inspecting the dataframe visually
dataframe_two[dataframe_two['spoken_languages'].duplicated(keep=False)]

Unnamed: 0_level_0,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124
559,258000000,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...",http://www.sonypictures.com/movies/spider-man3/,"[{""id"": 851, ""name"": ""dual identity""}, {""id"": ...",en,Spider-Man 3,The seemingly invincible Spider-Man goes up ag...,115.699814,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-01,890871626,139.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,The battle within.,Spider-Man 3,5.9,3576
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9367,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]","[{""iso_3166_1"": ""MX"", ""name"": ""Mexico""}, {""iso...",1992-09-04,2040920,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",El Mariachi,6.6,238
72766,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],[],2011-12-26,0,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,Newlyweds,5.9,5
231617,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",http://www.hallmarkchannel.com/signedsealeddel...,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2013-10-13,0,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,"Signed, Sealed, Delivered",7.0,6
126186,0,[],http://shanghaicalling.com/,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-05-03,0,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,Shanghai Calling,5.7,7


In [187]:
#FILL IN - Inspecting the dataframe programmatically

json.loads(dataframe_two['production_companies'].iloc[0])

[{'name': 'Ingenious Film Partners', 'id': 289},
 {'name': 'Twentieth Century Fox Film Corporation', 'id': 306},
 {'name': 'Dune Entertainment', 'id': 444},
 {'name': 'Lightstorm Entertainment', 'id': 574}]

In [188]:
#FILL IN - Inspecting the dataframe programmatically
json.loads(dataframe_two['spoken_languages'].iloc[0])

[{'iso_639_1': 'en', 'name': 'English'},
 {'iso_639_1': 'es', 'name': 'Español'}]

In [189]:
dataframe_two.columns

Index(['budget', 'genres', 'homepage', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

Issue and justification: 

On 'production_companies' column each column should have its own column for analysis. Currently the column has nested structure which violates the tidiness priciple of 'one variable per column'. Therefore it needs to be properly parse through. On 'spoken_languages' column, we are able to see that multiple values are within a single cell. This creates issues for say if we want to filter our analysis through grouping of languages.

## 3. Clean data
Clean the data to solve the 4 issues corresponding to data quality and tidiness found in the assessing step. **Make sure you include justifications for your cleaning decisions.**

After the cleaning for each issue, please use **either** the visually or programatical method to validate the cleaning was succesful.

At this stage, you are also expected to remove variables that are unnecessary for your analysis and combine your datasets. Depending on your datasets, you may choose to perform variable combination and elimination before or after the cleaning stage. Your dataset must have **at least** 4 variables after combining the data.

In [236]:
# FILL IN - Make copies of the datasets to ensure the raw dataframes are not impacted
df1 = dataframe_one.copy()
df2 = dataframe_two.copy()

In [237]:
df1.head()

Unnamed: 0_level_0,title,cast,crew
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [238]:
df2.head()

Unnamed: 0_level_0,budget,genres,homepage,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
19995,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
285,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
206647,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
49026,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
49529,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### **Quality Issue 1: FILL IN**

In [243]:
# FILL IN - Apply the cleaning strategy
df1['cast']

movie_id
19995     [{"cast_id": 242, "character": "Jake Sully", "...
285       [{"cast_id": 4, "character": "Captain Jack Spa...
206647    [{"cast_id": 1, "character": "James Bond", "cr...
49026     [{"cast_id": 2, "character": "Bruce Wayne / Ba...
49529     [{"cast_id": 5, "character": "John Carter", "c...
                                ...                        
9367      [{"cast_id": 1, "character": "El Mariachi", "c...
72766     [{"cast_id": 1, "character": "Buzzy", "credit_...
231617    [{"cast_id": 8, "character": "Oliver O\u2019To...
126186    [{"cast_id": 3, "character": "Sam", "credit_id...
25975     [{"cast_id": 3, "character": "Herself", "credi...
Name: cast, Length: 4803, dtype: object

In [246]:
info = pd.json_normalize(df1['cast'])



SyntaxError: closing parenthesis ')' does not match opening parenthesis '[' (1139258280.py, line 2)

In [233]:
normalized_df1 =  df1.columns
normalized_df1

Index(['title', 'cast', 'crew'], dtype='object')

In [155]:
# FILL IN - Validate the cleaning was successful

Justification: 

Now that we copy our first dataframe, it is time to address the quality of the dataset. As stated earlier, the dataset contained JSON data. It needed to be parsed throughly for futher analysis. The way I did this process was explained step by step in the cell above.

### **Quality Issue 2: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy

In [None]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Tidiness Issue 1: FILL IN**

In [None]:
#FILL IN - Apply the cleaning strategy
df1 = df1.drop(columns = 'crew' , inplace= True)

In [None]:
#FILL IN - Validate the cleaning was successful
df1.head

Justification: *FILL IN*
Decided that instead of normalizing the 'crew' column, I would instead drop the column as it has multiple observational units that is not relevent to my problem. By doing this it will enchance the tidiness of the dataset.


### **Tidiness Issue 2: FILL IN**

In [1]:
#FILL IN - Apply the cleaning strategy

In [2]:
#FILL IN - Validate the cleaning was successful

Justification: *FILL IN*

### **Remove unnecessary variables and combine datasets**

Depending on the datasets, you can also peform the combination before the cleaning steps.

In [None]:
#FILL IN - Remove unnecessary variables and combine datasets

## 4. Update your data store
Update your local database/data store with the cleaned data, following best practices for storing your cleaned data:

- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensure both the raw and cleaned data is saved to your database/data store

In [None]:
#FILL IN - saving data

## 5. Answer the research question

### **5.1:** Define and answer the research question 
Going back to the problem statement in step 1, use the cleaned data to answer the question you raised. Produce **at least** two visualizations using the cleaned data and explain how they help you answer the question.

*Research question:* FILL IN from answer to Step 1

In [None]:
#Visual 1 - FILL IN

*Answer to research question:* FILL IN

In [None]:
#Visual 2 - FILL IN

*Answer to research question:* FILL IN

### **5.2:** Reflection
In 2-4 sentences, if you had more time to complete the project, what actions would you take? For example, which data quality and structural issues would you look into further, and what research questions would you further explore?

*Answer:* FILL IN