https://www.dataquest.io/m/215/guided-project%3A-preparing-data-for-sqlite

A cleaned up and transformed version of [Academy Award nominations and winners dataset](https://www.aggdata.com/awards/oscar) will be used.

# 1. Introduction to the data

The dataset shows info on Academy Award nominations upto the year 2010.

This project will extract rows only for actors and actresses categories between the years 2001 and 2010. Then, they will be added to a SQLite table.

## 1.1. Quick glance at the data

This is a preview of the data.

In [1]:
import pandas as pd
from IPython.display import display
from pprint import pprint

# read in dataset
df = pd.read_csv("academy_awards.csv", encoding="ISO-8859-1")

# show top five rows of data
display(df.head())

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


In [2]:
# show column names
pprint(df.columns.tolist())

['Year',
 'Category',
 'Nominee',
 'Additional Info',
 'Won?',
 'Unnamed: 5',
 'Unnamed: 6',
 'Unnamed: 7',
 'Unnamed: 8',
 'Unnamed: 9',
 'Unnamed: 10']


## 1.2. Explore data further

### 1.2.1. "Unnamed: " columns

From the preview above, the "Unnamed: " columns seem not useful.

To check whether this is true, displayed below are the counts of each unique value in these columns.

In [3]:
# check values of "Unnamed: 5" columns
pprint(df["Unnamed: 5"].value_counts())

*                                                                                                               7
 discoverer of stars                                                                                            1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 resilience                                                                                                     1
Name: Unnamed: 5, dtype: int64


In [4]:
# check values of "Unnamed: 6" columns
pprint(df["Unnamed: 6"].value_counts())

*                                                                   9
 sympathetic                                                        1
 flexibility and water resistance                                   1
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
Name: Unnamed: 6, dtype: int64


In [5]:
# check values of "Unnamed: 7" columns
pprint(df["Unnamed: 7"].value_counts())

*                                                     1
 kindly                                               1
 while requiring no dangerous solvents. [Systems]"    1
Name: Unnamed: 7, dtype: int64


In [6]:
# check values of "Unnamed: 8" columns
pprint(df["Unnamed: 8"].value_counts())

*                                                 1
 understanding comedy genius - Mack Sennett.""    1
Name: Unnamed: 8, dtype: int64


In [7]:
# check values of "Unnamed: 9" columns
pprint(df["Unnamed: 9"].value_counts())

*    1
Name: Unnamed: 9, dtype: int64


In [8]:
# check values of "Unnamed: 10" columns
pprint(df["Unnamed: 10"].value_counts())

*    1
Name: Unnamed: 10, dtype: int64


It can now be confirmed that the "Unnamed: " columns can be discarded.

# 2. Filtering the data


Extract the rows where (1) years range from 2001 to 2010 (2) for actor and actress categories.


## 2.1. Modify "Year" column

The "Year" column will be adjusted so that it becomes 4-digit integer values representing years.

In [9]:
# Simplify "Year" column (e.g. "2010 (83rd)" --> "2010")
df["Year"] = df["Year"].str.slice(start=0, stop=4)

# Convert format (string --> integer)
df["Year"] = df["Year"].astype("int64")

# Show first 5 rows of "Year" column
df["Year"].head()

0    2010
1    2010
2    2010
3    2010
4    2010
Name: Year, dtype: int64

## 2.2. Filter dataset by year and category

[`pandas.Series.isin`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) is used for filtering by category.

In [10]:
# 1. Extract rows for years 2001 and later
later_than_2000 = df[df["Year"] > 2000]

# 2. Extract nominations for actors and actresses
award_categories = ["Actor -- Leading Role", 
                    "Actor -- Supporting Role", 
                    "Actress -- Leading Role", 
                    "Actress -- Supporting Role"]

nominations = later_than_2000[later_than_2000["Category"].isin(award_categories)]

# 3. Cleaning up the Won? and Unnamed columns

Values in "Won?" column will be converted ("Yes" --> 1; "NO" --> 0) so that they are compatible with SQLite. I will use [`pandas.Series.map`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) for the conversion.

Then, the column will be renamed as "Won".

In [11]:
# Turn off "SettingWithCopyWarning" warning (https://stackoverflow.com/a/20627316)
# (if not turned off, the next two cells will create a warning message.)
pd.options.mode.chained_assignment = None  # default='warn'

In [12]:
# convert values
def convert(string):
    """
    Convert strings
    "YES" --> 1
    "NO" --> 0
    Any other values are untouched.
    """
    
    conv_dic = {"YES": 1, 
               "NO": 0}
    
    output = conv_dic[string] if string in conv_dic \
                              else string
    
    return output

nominations["Won?"] = nominations["Won?"].map(convert)

In [13]:
# Rename columns
nominations["Won"] = nominations["Won?"]
final_nominations = nominations.drop(labels=['Won?', 
                          'Unnamed: 5',
                          'Unnamed: 6',
                          'Unnamed: 7',
                          'Unnamed: 8',
                          'Unnamed: 9',
                          'Unnamed: 10'],
                 axis=1)

# show first five rows of filtered dataset
final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0


# 4. Cleaning up the Additional Info column

Values in the "Additional Info" column (e.g. "Biutiful {'Uxbal'}") will be split into two columns so that one column will contain the movie title ("Biutiful") and the other the character name ("Uxbal").

First, I will check if all rows can be splitted with " {'" and "'}".

In [14]:
# do all rows contain " {'"?
print(final_nominations["Additional Info"].str.contains(" {'").unique())

# do all rows end with "'}"?
print(final_nominations["Additional Info"].str.endswith("'}").unique())

[ True]
[ True]


It's safe to do so.

Now, each row will be splitted and saved into "Movie" and "Character" columns, which will replace the current "Additional Info" column.

In [15]:
# strip "'}"
additional_info_one = final_nominations["Additional Info"].str.rstrip("'}")

# split by " {'"
additional_info_two = additional_info_one.str.split(" {'")

# add splitted values into Movie and Character columns
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]

final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters

# remove Additional Info column
final_nominations.drop(labels="Additional Info", axis=1, inplace=True)

# show first five rows of dataset
final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston


# 5. Exporting to SQLite

In [16]:
import sqlite3

# connect to database
conn = sqlite3.connect("nominations.db")

# create table
final_nominations.to_sql(name="nominations", con=conn, index=False, if_exists="replace")

# 6. Verifying in SQL

This section verifies that the transfer from Pandas data frame into a table in a SQLite database.

In [17]:
cursor = conn.cursor()

# show table schema
query = "PRAGMA table_info(nominations)"
cursor.execute(query)
pprint(cursor.fetchall())

[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]


In [18]:
# show first five rows of table in the SQLite database
query = "SELECT * FROM nominations LIMIT 10"

cursor.execute(query)
pprint(cursor.fetchall())

# close connection to database
conn.close()

[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0,
  'True Grit',
  'Rooster Cogburn'),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0,
  'The Social Network',
  'Mark Zuckerberg'),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1,
  "The King's Speech",
  'King George VI'),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0,
  '127 Hours',
  'Aron Ralston'),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1,
  'The Fighter',
  'Dicky Eklund'),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0,
  "Winter's Bone",
  'Teardrop'),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0,
  'The Town',
  'James Coughlin'),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0,
  'The Kids Are All Right',
  'Paul'),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0,
  "The King's Speech",
  'Lionel Logue')]


# 7. Further questions

Following are further questions put forward by DataQuest. I will not apply these to the current output, but only try to answer the questions.

## 7.1. The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?

First, I will display for each category from pre-2001 record.

In [27]:
# get categories of pre-2001 ceremonies
cat_pre2001 = set(df[df["Year"] < 2001]["Category"].unique().tolist())

# get categories of 2001-2010 ceremonies
cat_2001_2010 = set(df[df["Year"] >= 2001]["Category"].unique().tolist())

# get categories existing only in pre-2001 categories
cat_pre2001_only = cat_pre2001 - cat_2001_2010
pprint(cat_pre2001_only)

{'Acting (other)',
 'Assistant Director (archaic category)',
 'Dance Direction (archaic category)',
 'Documentary (other)',
 'Engineering Effects (archaic category)',
 'Special Achievement Award',
 'Special Effects (archaic category)',
 'Unique and Artistic Picture (archaic category)'}


Now, two rows per category will be displayed for pre-2001 record: one row with null "Additional Info", and another with non-null "Additional Info".

In [26]:
df_pre2011_only = df[df["Category"].isin(cat_pre2001_only)][["Year", "Category", "Nominee", "Additional Info"]]

# get all categories
cats = df_pre2011_only["Category"].value_counts().index

# display one null and one non-null rows from each category
for i in cats:
    df1 = df[(df["Category"] == i) & (pd.notna(df["Additional Info"]))]\
          [["Year", "Category", "Nominee", "Additional Info"]].head(1)
    df2 = df[(df["Category"] == i) & (pd.isnull(df["Additional Info"]))]\
          [["Year", "Category", "Nominee", "Additional Info"]].head(1)
    display(df1.append(df2))

Unnamed: 0,Year,Category,Nominee,Additional Info
6075,1962,Special Effects (archaic category),The Longest Day,Visual Effects by Robert MacDonald; Audible Ef...
9442,1938,Special Effects (archaic category),For outstanding achievement in creating Specia...,


Unnamed: 0,Year,Category,Nominee,Additional Info
9566,1937,Assistant Director (archaic category),In Old Chicago,Robert Webb
9903,1932,Assistant Director (archaic category),Percy Ikerd (Fox),


Unnamed: 0,Year,Category,Nominee,Additional Info
8657,1942,Documentary (other),"Africa, Prelude to Victory",The March of Time


Unnamed: 0,Year,Category,Nominee,Additional Info
9572,1937,Dance Direction (archaic category),"Bobby Connolly -- Too Marvelous for Words"" num...","Willing and Able"""
9571,1937,Dance Direction (archaic category),"Busby Berkeley -- The Finale"" number from Vars...",


Unnamed: 0,Year,Category,Nominee,Additional Info
2721,1990,Special Achievement Award,Total Recall,"Eric Brevig, Rob Bottin, Tim McGovern, Alex Funke"
2081,1995,Special Achievement Award,"To John Lasseter, for his inspired leadership ...",


Unnamed: 0,Year,Category,Nominee,Additional Info
6226,1960,Acting (other),"To Hayley Mills for Pollyanna, the most outsta...",


Unnamed: 0,Year,Category,Nominee,Additional Info
10134,1927,Unique and Artistic Picture (archaic category),Fox,Sunrise


Unnamed: 0,Year,Category,Nominee,Additional Info
10132,1927,Engineering Effects (archaic category),Roy Pomeroy,Wings
10131,1927,Engineering Effects (archaic category),Ralph Hammeras [NOTE: This nomination was not ...,


As this project focuses on awards for acting, I will consider only "Acting (other)" category useful. Let's take a look at all of them.

<a name="act_other_table"></a>

In [30]:
df_pre2011_only[df_pre2011_only["Category"] == "Acting (other)"]

Unnamed: 0,Year,Category,Nominee,Additional Info
6226,1960,Acting (other),"To Hayley Mills for Pollyanna, the most outsta...",
6966,1954,Acting (other),To Jon Whiteley for his outstanding juvenile p...,
6967,1954,Acting (other),To Vincent Winter for his outstanding juvenile...,
7623,1949,Acting (other),"To Bobby Driscoll, as the outstanding juvenile...",
7745,1948,Acting (other),"To Ivan Jandl, for the outstanding juvenile pe...",
7860,1947,Acting (other),To James Baskett for his able and heart-warmin...,
7974,1946,Acting (other),To Harold Russell for bringing hope and courag...,
7975,1946,Acting (other),"To Claude Jarman, Jr., outstanding child actor...",
8089,1945,Acting (other),"To Peggy Ann Garner, outstanding child actress...",
8252,1944,Acting (other),"To Margaret O'Brien, outstanding child actress...",


Unlike other acting categories, the "Additional Info" column does not include movie titles and character names.

Below is the full list of "Nominees" values for "Acting (other)" category.

<a name="act_other_nominee"></a>

In [34]:
for i, s in df_pre2011_only[df_pre2011_only["Category"] == "Acting (other)"].iterrows():
    print(s["Nominee"])

To Hayley Mills for Pollyanna, the most outstanding juvenile performance during 1960.
To Jon Whiteley for his outstanding juvenile performance in The Little Kidnappers.
To Vincent Winter for his outstanding juvenile performance in The Little Kidnappers.
To Bobby Driscoll, as the outstanding juvenile actor of 1949.
To Ivan Jandl, for the outstanding juvenile performance of 1948, as Karel Malik" in The Search."
To James Baskett for his able and heart-warming characterization of Uncle Remus, friend and story teller to the children of the world in Walt Disney's Song of the South.
To Harold Russell for bringing hope and courage to his fellow veterans through his appearance in The Best Years of Our Lives.
To Claude Jarman, Jr., outstanding child actor of 1946.
To Peggy Ann Garner, outstanding child actress of 1945.
To Margaret O'Brien, outstanding child actress of 1944.
To Judy Garland for her outstanding performance as a screen juvenile during the past year.
To Deanna Durbin and Mickey Roon

What we could add to the current project (which I will not) are movie titles included in some rows and character 
names included in very few.

## 7.2. What are all the different formatting styles that the Additional Info column contains? Can we use tools like regular expressions to capture these patterns and clean them up?

Now, two rows per category will be displayed: one row with null "Additional Info", and another with non-null "Additional Info".

In [63]:
# get all categories
cats = df["Category"].value_counts().index

# display one null and one non-null rows from each category
for i in cats:
    df1 = df[(df["Category"] == i) & (pd.notna(df["Additional Info"]))]\
          [["Year", "Category", "Additional Info"]].head(1)
    df2 = df[(df["Category"] == i) & (pd.isnull(df["Additional Info"]))]\
          [["Year", "Category", "Additional Info"]].head(1)
    display(df1.append(df2))

Unnamed: 0,Year,Category,Additional Info
110,2010,Writing,Screenplay by Danny Boyle & Simon Beaufoy
10126,1927,Writing,


Unnamed: 0,Year,Category,Additional Info
66,2010,Music (Scoring),John Powell


Unnamed: 0,Year,Category,Additional Info
28,2010,Cinematography,Matthew Libatique
9342,1938,Cinematography,


Unnamed: 0,Year,Category,Additional Info
23,2010,Art Direction,Production Design: Robert Stromberg; Set Decor...
8260,1944,Art Direction,


Unnamed: 0,Year,Category,Additional Info
75,2010,Best Picture,"Mike Medavoy, Brian Oliver and Scott Franklin,..."


Unnamed: 0,Year,Category,Additional Info
95,2010,Sound,"Lora Hirschberg, Gary A. Rizzo and Ed Novick"
9954,1931,Sound,


Unnamed: 0,Year,Category,Additional Info
90,2010,Short Film (Live Action),Tanel Toom


Unnamed: 0,Year,Category,Additional Info
132,2010,Scientific and Technical (Technical Achievemen...,was shared with the industry in their technic...
128,2010,Scientific and Technical (Technical Achievemen...,


Unnamed: 0,Year,Category,Additional Info
71,2010,Music (Song),"Music and Lyric by Tom Douglas, Troy Verges an..."


Unnamed: 0,Year,Category,Additional Info
10,2010,Actress -- Leading Role,The Kids Are All Right {'Nic'}


Unnamed: 0,Year,Category,Additional Info
38,2010,Directing,Darren Aronofsky
10116,1927,Directing,


Unnamed: 0,Year,Category,Additional Info
0,2010,Actor -- Leading Role,Biutiful {'Uxbal'}
10100,1927,Actor -- Leading Role,


Unnamed: 0,Year,Category,Additional Info
53,2010,Film Editing,Andrew Weisblum


Unnamed: 0,Year,Category,Additional Info
33,2010,Costume Design,Colleen Atwood


Unnamed: 0,Year,Category,Additional Info
15,2010,Actress -- Supporting Role,The Fighter {'Charlene Fleming'}


Unnamed: 0,Year,Category,Additional Info
5,2010,Actor -- Supporting Role,The Fighter {'Dicky Eklund'}


Unnamed: 0,Year,Category,Additional Info
85,2010,Short Film (Animated),Teddy Newton


Unnamed: 0,Year,Category,Additional Info
48,2010,Documentary (Short Subject),Jed Rothstein


Unnamed: 0,Year,Category,Additional Info
43,2010,Documentary (Feature),Banksy and Jaimie D'Cruz
8847,1941,Documentary (Feature),


Unnamed: 0,Year,Category,Additional Info
58,2010,Foreign Language Film,Mexico
6879,1955,Foreign Language Film,


Unnamed: 0,Year,Category,Additional Info
764,2005,Scientific and Technical (Scientific and Engin...,providing the key in demonstrating to the ind...
124,2010,Scientific and Technical (Scientific and Engin...,


Unnamed: 0,Year,Category,Additional Info
6669,1957,Honorary Award,motion picture pioneer
120,2010,Honorary Award,


Unnamed: 0,Year,Category,Additional Info
105,2010,Visual Effects,"Ken Ralston, David Schaub, Carey Villegas and ..."


Unnamed: 0,Year,Category,Additional Info
100,2010,Sound Editing,Richard King


Unnamed: 0,Year,Category,Additional Info
6075,1962,Special Effects (archaic category),Visual Effects by Robert MacDonald; Audible Ef...
9442,1938,Special Effects (archaic category),


Unnamed: 0,Year,Category,Additional Info
63,2010,Makeup,Adrien Morot
5267,1968,Makeup,


Unnamed: 0,Year,Category,Additional Info
507,2007,Scientific and Technical (Academy Award of Merit),


Unnamed: 0,Year,Category,Additional Info
123,2010,Irving G. Thalberg Memorial Award,


Unnamed: 0,Year,Category,Additional Info
20,2010,Animated Feature Film,Chris Sanders and Dean DeBlois
1987,1995,Animated Feature Film,


Unnamed: 0,Year,Category,Additional Info
9566,1937,Assistant Director (archaic category),Robert Webb
9903,1932,Assistant Director (archaic category),


Unnamed: 0,Year,Category,Additional Info
134,2010,Scientific and Technical (Bonner Medal),


Unnamed: 0,Year,Category,Additional Info
386,2008,Jean Hersholt Humanitarian Award,


Unnamed: 0,Year,Category,Additional Info
8657,1942,Documentary (other),The March of Time


Unnamed: 0,Year,Category,Additional Info
9572,1937,Dance Direction (archaic category),"Willing and Able"""
9571,1937,Dance Direction (archaic category),


Unnamed: 0,Year,Category,Additional Info
391,2008,Scientific and Technical (Gordon E. Sawyer Award),


Unnamed: 0,Year,Category,Additional Info
2721,1990,Special Achievement Award,"Eric Brevig, Rob Bottin, Tim McGovern, Alex Funke"
2081,1995,Special Achievement Award,


Unnamed: 0,Year,Category,Additional Info
1293,2001,Scientific and Technical (Special Awards),"first published by the ASC in 1930, the Ameri..."
519,2007,Scientific and Technical (Special Awards),


Unnamed: 0,Year,Category,Additional Info
6226,1960,Acting (other),


Unnamed: 0,Year,Category,Additional Info
10132,1927,Engineering Effects (archaic category),Wings
10131,1927,Engineering Effects (archaic category),


Unnamed: 0,Year,Category,Additional Info
10134,1927,Unique and Artistic Picture (archaic category),Sunrise


I will not how these can be cleaned up as there are too many categories to address within each of which there could be variations.

Instead, for the next question, I will discuss (not demonstrate) it for only for the "Art Direction" category.

### 7.2.1.  The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?

First, let's look at a few rows for "Art Direction" category.

In [66]:
df[df["Category"] == "Art Direction"].head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
23,2010,Art Direction,Alice in Wonderland,Production Design: Robert Stromberg; Set Decor...,YES,,,,,,
24,2010,Art Direction,Harry Potter and the Deathly Hallows Part 1,Production Design: Stuart Craig; Set Decoratio...,NO,,,,,,
25,2010,Art Direction,Inception,Production Design: Guy Hendrix Dyas; Set Decor...,NO,,,,,,
26,2010,Art Direction,The King's Speech,Production Design: Eve Stewart; Set Decoration...,NO,,,,,,
27,2010,Art Direction,True Grit,Production Design: Jess Gonchor; Set Decoratio...,NO,,,,,,


OK, now I will list all "Additional Info" values for "Art Direction" category.

In [94]:
for i, s in df[df["Category"] == "Art Direction"].iterrows():
    print(s["Additional Info"])

Production Design: Robert Stromberg; Set Decoration: Karen O'Hara
Production Design: Stuart Craig; Set Decoration: Stephenie McMillan
Production Design: Guy Hendrix Dyas; Set Decoration: Larry Dias and Doug Mowat
Production Design: Eve Stewart; Set Decoration: Judy Farr
Production Design: Jess Gonchor; Set Decoration: Nancy Haigh
Production Design: Rick Carter and Robert Stromberg; Set Decoration: Kim Sinclair
Production Design: Dave Warren and Anastasia Masaro; Set Decoration: Caroline Smith
Production Design: John Myhre; Set Decoration: Gordon Sim
Production Design: Sarah Greenwood; Set Decoration: Katie Spencer
Production Design: Patrice Vermette; Set Decoration: Maggie Gray
Art Direction: James J. Murakami; Set Decoration: Gary Fettis
Art Direction: Donald Graham Burt; Set Decoration: Victor J. Zolfo
Art Direction: Nathan Crowley; Set Decoration: Peter Lando
Art Direction: Michael Carlin; Set Decoration: Rebecca Alleway
Art Direction: Kristi Zea; Set Decoration: Debra Schutt
Art Di

The values are in irregular pattern. Some contain only names (e.g. "William S. Darling, David Hall") whereas others contain field name and name of nominee (e.g. "Production Design: Robert Stromberg; Set Decoration: Karen O'Hara
"). I will demonstrate parsing one value of the latter case.

In [96]:
add_info_ad = df[df["Category"] == "Art Direction"]["Additional Info"].iloc[0]

add_info_ad_split = re.split(";\ ", add_info_ad)

for i in add_info_ad_split:
    print(re.split(":\ ", i))

['Production Design', 'Robert Stromberg']
['Set Decoration', "Karen O'Hara"]


## Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?

I will first check if character names are missing in any rows for "Additional Info" column.

In [57]:
import re

# 1. get acting-related rows from original dataset
df_act = df[(df["Category"].str.contains("act", case=False)) & \
         (~df["Category"].str.contains("action", case=False))]


# 2. check if any rows are missing character names

add_info_atypical = []
add_info_null = []

# get values in "Additional Info" column for acting-related rows
for i, s in df_act.iterrows():
    if "act" in s["Category"].lower() and \
    "action" not in s["Category"].lower():
        sval = s["Additional Info"]
        
        # show "Additional Info" values if appearing atypical
        if s.notnull()["Additional Info"]:
            sval_split = list(filter(None, re.split(" {'", re.sub("'}", "", sval))))
            if len(sval_split) != 2:
                add_info_atypical.append(sval)
                
display(add_info_atypical)

["The Big Pond {'Pierre Mirande'}; and The Love Parade {'Count Alfred Renard'}",
 "Bulldog Drummond {'Hugh 'Bulldog' Drummond'}; and Condemned {'Michel'}",
 "Anna Christie {'Anna Christie'}; and Romance {'Madame Rita Cavallini'}",
 "The Noose {'Nickie Elkins'}; and The Patent Leather Kid {'The Patent Leather Kid'}",
 "The Last Command {'General Dolgorucki [Grand Duke Sergius Alexander]'}; and The Way of All Flesh {'August Schilling'}",
 "7th Heaven {'Diane'}; Street Angel {'Angela'}; and Sunrise {'The Wife'}"]

There were rows with atypical values, but none missed either movie title or character name.

Next, I will check the rows where "Additional Info" has null values.

In [58]:
display(df_act[df_act["Additional Info"].isna()].drop(labels=['Won?', 
                                              'Unnamed: 5',
                                              'Unnamed: 6',
                                              'Unnamed: 7',
                                              'Unnamed: 8',
                                              'Unnamed: 9',
                                              'Unnamed: 10'],
                                             axis=1))

Unnamed: 0,Year,Category,Nominee,Additional Info
6226,1960,Acting (other),"To Hayley Mills for Pollyanna, the most outsta...",
6966,1954,Acting (other),To Jon Whiteley for his outstanding juvenile p...,
6967,1954,Acting (other),To Vincent Winter for his outstanding juvenile...,
7623,1949,Acting (other),"To Bobby Driscoll, as the outstanding juvenile...",
7745,1948,Acting (other),"To Ivan Jandl, for the outstanding juvenile pe...",
7860,1947,Acting (other),To James Baskett for his able and heart-warmin...,
7974,1946,Acting (other),To Harold Russell for bringing hope and courag...,
7975,1946,Acting (other),"To Claude Jarman, Jr., outstanding child actor...",
8089,1945,Acting (other),"To Peggy Ann Garner, outstanding child actress...",
8252,1944,Acting (other),"To Margaret O'Brien, outstanding child actress...",


This is almost identical with the [table](#act_other_table) and [values](#act_other_nominee) we saw above. As stated there, movie titles could be extracted from the "Nominee" column in some rows. Character names are more rarely found.

The trade off of tossing out the character names for most of these rows will be that the records cannot be looked up using them.

## What's the best way to handle awards ceremonies that included movies from 2 years (e.g. "1927/28 (1st)")?

I would take the first year. This is an ambiguous choice and not important.

It would be important, though, to be consistent about it.