### 2.1 Parse IMDB Data

For the IMDB data (25K data points) we downloaded from IMDB server, 
we want to clean each column and extract the information.

- for response "genres", split multiple genres for one movie ans store them in a list
- for variables "runtimes", "language codes", "languages", "country codes", "countries": extract the multiple entries for one movie and store the result in a list
- for variables "director", "writer","cast","distributors", "producer", "production companies","cinematographer", "animation department", "original music", "editorial department": extract the multiple person IDs and company IDs for one movie and store the result in a list
- for variable "mpaa", we extract the MPAA rating for each movie, and store it under variable "mpaa"; we also keep the explaination of why the rating is made, and store the string under variable "mpaa_reason"

In [7]:
##### The function to parse data downloaded from IMDB
##### import the raw data by data_name, and specify the file name of the parsed data frame

def parse_imdb(data_name = "imdb_data.txt", new_name = "imdb_data_parse.txt"):

        import pandas as pd
        import numpy as np
        import re
        import warnings
        warnings.filterwarnings("ignore")

        #### load the imdb data to be parsed
        df = pd.read_csv(data_name)

        ## copy column "mpaa" to store the reason of MPAA rating
        df2 = df
        df2["mpaa_reason"] = df["mpaa"]

        
        #### select variables 
        #group0 = ["genres", "imdb_id"] 
        #group1 = ["title", "year", "rating", "votes", "top 250 rank", "kind"]
        #group2 = ["runtimes", "language codes", "languages", "country codes", "countries"]
        #group3 = ["mpaa", "mpaa_reason"]
        #group4 = ["director", "writer","cast","distributors", "producer", "production companies","cinematographer", "animation department", "original music", "editorial department"]
        
       
        
             
        #### for each row of the data
        for i in range(len(df)):

                ### -----------------------------------------------------------------------
                ## for variables with multiple items, return a list (later turn into dummy coding)

                for val in ["genres", "runtimes", "language codes", "languages", "country codes", "countries"]:
                        if type(df.ix[i,val]) == str:  # when the entry is not 'float(nan)'
                                st = df.ix[i,val]
                                value = st[3:-2].split("', u'") 
                                df2[val][i] = value
  
                ### -----------------------------------------------------------------------
                ## extract movie MPAA rating as a string (later turn into factor)

                for val in ["mpaa"]:  
                        if type(df.ix[i,val]) == str:  # when the entry is not 'float(nan)'
                            st = df.ix[i,val]
                            value = re.findall(r'Rated (.*?) for',st)[0] # extract the word between "Rated" and "for"
                            df2[val][i] = value
                ### -----------------------------------------------------------------------
                ## extract the reason of MPAA rating as a string (for text analysis later)

                for val in ["mpaa_reason"]:
                        if type(df.ix[i,val]) == str:  # when the entry is not 'float(nan)'
                            st = df.ix[i,val]
                            value = st.split("for ", 1)[1]
                            df2[val][i] = value
                ### -----------------------------------------------------------------------
                ## for variables in the dictionary format, return a list of:
                ##     - person id 
                ##     - company id (for "production companies" and "distributors")

                for val in ["director", "writer","cast", 
                            "distributors", "producer", "production companies", 
                            "cinematographer", "animation department", 
                                "original music", "editorial department"]:
                        if type(df.ix[i,val]) == str:  # when the entry is not 'float(nan)'
                            st = df.ix[i,val]
                            value = re.findall(r'\d+',st)
                            df2[val][i] = value
                            
                ### -----------------------------------------------------------------------
                ## variables that may not be useful:
                    # ["cover url", "full-size cover url"," canonical title", "canonical title.1", 
                    # "long imdb title", "long imdb canonical title", "smart canonical title", 
                    # "smart long imdb canonical title"]
                    
        
        df2.to_csv(new_name)


For demonstration purpose, here we use the top 100 rows of the database.

In [8]:
parse_imdb(data_name = "imdb_top100_data.txt", new_name = "imdb_top100_data_parse.txt" )

In [9]:
imdb_top100_data_parse = pd.read_csv( "imdb_top100_data_parse.txt")
imdb_top100_data_parse.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,title,genres,director,distributors,year,rating,votes,...,canonical title,editorial department,canonical title.1,long imdb title,long imdb canonical title,smart canonical title,smart long imdb canonical title,full-size cover url,imdb_ids,mpaa_reason
0,0,0,0,Four Rooms,['Comedy'],"['0025978', '0734319', '0001675', '0000233']","['0051881', '0014703', '0080682', '0022594', '...",1995.0,6.7,81543.0,...,Four Rooms,"['0020317', '0003692', '0042901', '0120656', '...",Four Rooms,Four Rooms (1995),Four Rooms (1995),Four Rooms,Four Rooms (1995),https://images-na.ssl-images-amazon.com/images...,113101,"pervasive strong language, sexuality and some ..."
1,1,1,1,"Sonntag, im August",['Short'],['1262754'],,2004.0,6.5,11.0,...,"Sonntag, im August",,"Sonntag, im August","Sonntag, im August (2004)","Sonntag, im August (2004)","Sonntag, im August","Sonntag, im August (2004)",https://images-na.ssl-images-amazon.com/images...,425473,
2,2,2,2,Star Wars,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",['0000184'],"['0000756', '0074275', '20', '0105985', '20', ...",1977.0,8.7,969951.0,...,Star Wars,"['0091250', '0300879', '0366165', '0385492', '...",Star Wars,Star Wars (1977),Star Wars (1977),Star Wars,Star Wars (1977),https://images-na.ssl-images-amazon.com/images...,76759,sci-fi violence and brief mild language
3,3,3,3,Finding Nemo,"['Animation', 'Adventure', 'Comedy', 'Family']","['0004056', '0881279']","['0358618', '0064622', '0000779', '0110125', '...",2003.0,8.1,743705.0,...,Finding Nemo,"['0023224', '1404655', '1733701', '1404694', '...",Finding Nemo,Finding Nemo (2003),Finding Nemo (2003),Finding Nemo,Finding Nemo (2003),https://images-na.ssl-images-amazon.com/images...,266543,
4,4,4,4,The Dark,"['Horror', 'Mystery', 'Thriller']",['0269502'],"['0002257', '0059064', '0022594', '0049085', '...",2005.0,5.4,9415.0,...,"Dark, The","['1262315', '3024002', '1930271', '0996933', '...","Dark, The",The Dark (2005),"Dark, The (2005)","Dark, The","Dark, The (2005)",https://images-na.ssl-images-amazon.com/images...,411267,some violent/disturbing images and language
