In [1]:
import pandas as pd

data = pd.read_json('data.json', orient='records')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1304 entries, 0 to 1303
Data columns (total 63 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  1303 non-null   object 
 1   image                 1137 non-null   object 
 2   Hangul                1 non-null      object 
 3   Hanja                 1 non-null      object 
 4   Revised Romanization  1 non-null      object 
 5   McCune–Reischauer     1 non-null      object 
 6   Directed by           1286 non-null   object 
 7   Produced by           1261 non-null   object 
 8   Screenplay by         539 non-null    object 
 9   Story by              163 non-null    object 
 10  Starring              1147 non-null   object 
 11  Music by              1075 non-null   object 
 12  Cinematography        1111 non-null   object 
 13  Edited by             1139 non-null   object 
 14  Production            876 non-null    object 
 15  Distributed by       

Looking at the output of the `info` call we can see that many of the columns have only a few values. We can safely remove these columns form our final data model.

In [2]:
data = data.dropna(thresh=5, axis=1)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1304 entries, 0 to 1303
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            1303 non-null   object
 1   image           1137 non-null   object
 2   Directed by     1286 non-null   object
 3   Produced by     1261 non-null   object
 4   Screenplay by   539 non-null    object
 5   Story by        163 non-null    object
 6   Starring        1147 non-null   object
 7   Music by        1075 non-null   object
 8   Cinematography  1111 non-null   object
 9   Edited by       1139 non-null   object
 10  Production      876 non-null    object
 11  Distributed by  1214 non-null   object
 12  Release date    1296 non-null   object
 13  Running time    1283 non-null   object
 14  Country         1271 non-null   object
 15  Language        1267 non-null   object
 16  Budget          797 non-null    object
 17  Box office      877 non-null    object
 18  Based on

Checking why important columns such as name and release date have missing values

In [3]:
data[pd.isna(data['name'])]

Unnamed: 0,name,image,Directed by,Produced by,Screenplay by,Story by,Starring,Music by,Cinematography,Edited by,...,Language,Budget,Box office,Based on,Written by,Animation by,Layouts by,Color process,Narrated by,Backgrounds by
1170,,{'src': '//upload.wikimedia.org/wikipedia/en/t...,,,,,,,,,...,English,,,,,,,,,


Only the image seems to be present here

In [4]:
data[pd.isna(data['name'])].image.values[0]['src']

'//upload.wikimedia.org/wikipedia/en/thumb/a/ae/The_Lost_Thing_cover.jpg/220px-The_Lost_Thing_cover.jpg'

Doing some reseach we find the the link for this film actually points to a [book page](https://en.wikipedia.org/wiki/The_Lost_Thing) that does not have the same information in the side box as the rest. So ignoring this.

In [5]:
data[pd.isna(data['Release date'])]

Unnamed: 0,name,image,Directed by,Produced by,Screenplay by,Story by,Starring,Music by,Cinematography,Edited by,...,Language,Budget,Box office,Based on,Written by,Animation by,Layouts by,Color process,Narrated by,Backgrounds by
232,Episode chronology,,Robert Enrico,,,,,,,,...,,,,,"Robert Enrico, based on a short story by",,,,,
609,Czechoslovakia 1968,,Denis Sanders,Denis Sanders,,,,Charles Bernstein,,Marvin Wallowitz,...,English,,,,,,,,,
761,Release,,Ferenc Rofusz,,,,,,,,...,,,,,Ferenc Rofusz,,,,,
902,External links,{'src': '//upload.wikimedia.org/wikipedia/en/t...,,,,,,,,,...,,,,,,,,,,
972,External links,{'src': '//upload.wikimedia.org/wikipedia/en/t...,Jon Blair,,,,Glenn Close,,Barry Ackroyd,,...,,,,,Jon Blair,,,,Kenneth Branagh,
1070,Chernobyl Heart,,Maryann DeLeo,Maryann DeLeo,,,,,,John Custodio,...,,,,,,,,,,
1170,,{'src': '//upload.wikimedia.org/wikipedia/en/t...,,,,,,,,,...,English,,,,,,,,,
1219,Curfew,{'src': '//upload.wikimedia.org/wikipedia/en/t...,Shawn Christensen,Damon Russell,,,Fátima Ptacek,Darren Morze,Daniel Katz,Shawn Christensen,...,English,,,,Shawn Christensen,,,,,


These entries also have a lot of missing data on their respective pages-- including differnt html structure. So we can so far assume that the parsing is working properly.

Now let's try to structure the data into different tables 

In [6]:
from pandas.api.types import infer_dtype

dtypes = [infer_dtype(data[col]) for col in data.columns]

data.columns[list(map(lambda x: x == 'string', dtypes))]

Index(['name', 'Layouts by', 'Color process', 'Backgrounds by'], dtype='object')

So far most of the columns seem to have mixed types of data

In [7]:
mixed_data_clomuns = data.columns[list(map(lambda x: x == 'mixed', dtypes))]

for col in mixed_data_clomuns:
    print(data[col].map(type).describe())
    print(data[col].map(type).unique())
    print()

count               1304
unique                 2
top       <class 'dict'>
freq                1137
Name: image, dtype: object
[<class 'dict'> <class 'float'>]

count              1304
unique                3
top       <class 'str'>
freq               1253
Name: Directed by, dtype: object
[<class 'str'> <class 'list'> <class 'float'>]

count              1304
unique                3
top       <class 'str'>
freq                982
Name: Produced by, dtype: object
[<class 'list'> <class 'str'> <class 'float'>]

count                1304
unique                  3
top       <class 'float'>
freq                  765
Name: Screenplay by, dtype: object
[<class 'list'> <class 'str'> <class 'float'>]

count                1304
unique                  3
top       <class 'float'>
freq                 1141
Name: Story by, dtype: object
[<class 'str'> <class 'float'> <class 'list'>]

count              1304
unique                3
top       <class 'str'>
freq                608
Name: Starring, dtyp

Analyzing these results we realize that the float type data we see are `NaN` values and all other variables contain list.

While a movie can have multiple languages, it is suspicious for it to have multiple budgets or be based on multiple sources. Inspecting furthur:

In [27]:
for col in data.columns:
    print(data[[datatype == list for datatype in data[col].map(type)]][col].head())
    print()

Series([], Name: name, dtype: object)

Series([], Name: image, dtype: object)

27           [ (uncredited), Victor Fleming, King Vidor]
94                       [Guy Brenton, Lindsay Anderson]
137                      [Harve Foster, Wilfred Jackson]
152    [ (supervising), William Cottrell, David Hand,...
177                 [Michael Powell, Emeric Pressburger]
Name: Directed by, dtype: object

0    [Moon Yang-kwon, Bong Yok-cho, Jang Young-hwan...
1         [Denzel Washington, Todd Black, Scott Rudin]
2                         [Kristóf Deák, Anna Udvardy]
3    [Shawn Levy, Dan Levine, Aaron Ryder, David Li...
4    [Ezra Edelman, Tamara Rosenberg, Nina Krstic, ...
Name: Produced by, dtype: object

0                           [Bong Joon-ho, Han Jin-won]
8                     [Robert Schenkkan, Andrew Knight]
26                    [Samuel Hoffenstein, Eric Taylor]
27    [Noel Langley, Florence Ryerson, Edgar Allan W...
81                        [Tess Slesinger, Frank Davis]
Name: Screenp

We have finally stumbled on some inconsistencies in our parsing. Checking the `Based on` column closely we find that this might not be a list after all.

In [29]:
data[[datatype == list for datatype in data['Based on'].map(type)]]['Based on'].head()

192    [, a novel, by ,  , Vladimir Jabotinsky, Holy ...
234       [by Alan Jay Lerner, by , George Bernard Shaw]
278                                  [Louisa May Alcott]
599             [by , by , Alan Jay Lerner, T. H. White]
631    [1954 Biography, by , 1961 Autobiography, by ,...
Name: Based on, dtype: object

There are similar inconsistencies that are now aparent. However, we now have a through under standing of the structure of data we have and can go back to our scraping application.