We'll work with data on Academy Award nominations, which can be downloaded here: https://www.aggdata.com/awards/oscar. 

The Academy Awards are : https://en.wikipedia.org/wiki/Academy_Awards#Awards_of_Merit_categories

Here are the columns in the dataset, academy_awards.csv:

Year - the year of the awards ceremony.

Category - the category of award the nominee was nominated for.

Nominee - the person nominated for the award.

Additional Info - this column contains additional info like:

    the movie the nominee participated in.

    the character the nominee played (for acting awards).

Won? - this column contains either YES or NO depending on if the nominee won the award.

In [158]:
import pandas as pd 

df = pd.read_csv("academy_awards.csv", encoding="ISO-8859-1 ")

let's look at the data and see  if we can spot any quality issues 

In [159]:
df.head(3)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,


We can see there are 6 unnamed columns at the end. 

Will use the value_counts method to explore if any of them have valid values that we need.

Also  notice that the Additional Info column contains a few different formatting styles.

We will need to clean this column up.

In [160]:
df.iloc[:,5].value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

*                                                                                                               7
 discoverer of stars                                                                                            1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 resilience                                                                                                     1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
Name: Unnamed: 5, dtype: int64

In [161]:
df.iloc[:,7].value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

 kindly                                               1
*                                                     1
 while requiring no dangerous solvents. [Systems]"    1
Name: Unnamed: 7, dtype: int64

The dataset is incredibly messy, most columns don't have consistent formatting, which is incredibly important when we use SQL to query the data later on. Other columns vary in the information they convey based on the type of awards category that row corresponds to

<b>Filtering the data</b>

In [162]:
df["Year"].head(2)

0    2010 (83rd)
1    2010 (83rd)
Name: Year, dtype: object

Before we filter the data, let's clean up the Year column by selecting just the first 4 digits in each value in the column, therefore excluding the value in parentheses

In [163]:
# As you can see on cell 89 year type is object let's convert the Year column to the int64 data type using astype
df["Year"] = df["Year"].str[0:4].astype("int64")

In [164]:
df["Year"].head(2)

0    2010
1    2010
Name: Year, dtype: int64

In [165]:
#select only the rows from the Dataframe where the Year column is larger than 2000
later_than_2000 = df[df["Year"] > 2000]

In [166]:
# lets fillter award_categories that we're interested in such as Art Direction
later_than_2000["Category"][later_than_2000["Category"]=="Art Direction"].head(3)

23    Art Direction
24    Art Direction
25    Art Direction
Name: Category, dtype: object

In [167]:
award_categories = ["Actor -- Leading Role", "Actor -- Supporting Role", "Actress -- Leading Role", "Actress -- Supporting Role"]

In [168]:
# select only  where the Category matches one of the 4 awards we're interested in
nominations  = later_than_2000[later_than_2000["Category"].isin(award_categories) ]

In [169]:
# now in nominations we dont have Art Direction catg
nominations["Category"][nominations["Category"]=="Art Direction"].head(3)

Series([], Name: Category, dtype: object)

now let's convert the "Won?"  column to reflect this. Also rename the Won? column to Won so that it's consistent with the other column names. Finally, get rid of the 6 extra, unnamed columns, since they contain only null values in our filtered Dataframe nominations.

In [170]:
nominations['Won?'].head(4)

0     NO
1     NO
2     NO
3    YES
Name: Won?, dtype: object

In [171]:
nominations.iloc[:,5].value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
nominations.iloc[:,7].value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

Series([], Name: Unnamed: 7, dtype: int64)

In [172]:
#the 6 extra, unnamed columns contain only null values in our filtered Dataframe nominations so lets get rid of them
nominations.iloc[:,9].value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

Series([], Name: Unnamed: 9, dtype: int64)

column "Won?" is value is YES or NO  lets convert it to 1 or 0  
Also will rename the Won? column to Won  o that it's consistent with the other column names

In [173]:
#using Series method map to replace all NO values with 0 and all YES values with 1
replace_d = { 'YES': 1, 'NO': 0 } 
#reassign 
nominations['Won?'] = nominations['Won?'].map(replace_d)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [174]:
#verify
nominations['Won?'].head(4)

0    0
1    0
2    0
3    1
Name: Won?, dtype: int64

In [175]:
nominations['Won'] = nominations['Won?']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [176]:
nominations['Won'].head(4)

0    0
1    0
2    0
3    1
Name: Won, dtype: int64

In [177]:
#now lets drop the extraneous columns
drop_cols = ["Won?","Unnamed: 5", "Unnamed: 6","Unnamed: 7", "Unnamed: 8", "Unnamed: 9", "Unnamed: 10"]
final_nominations = nominations.drop(drop_cols, axis=1)