# Introudction
In this project we are going to clean csv dataset and then also we are going to store that cleaned dataset inside sqlite database. The columns available to our datafile are as follows.<br>

- Year - the year of the awards ceremony.
- Category - the category of award the nominee was nominated for.
- Nominee - the person nominated for the award.
- Additional Info - this column contains additional info like:
    - the movie the nominee participated in.
    - the character the nominee played (for acting awards).
- Won? - this column contains either YES or NO depending on if the nominee won the award.


In [61]:
# reading the datafile
import pandas as pd
data = pd.read_csv('academy_awards.csv',encoding='ISO-8859-1')
data.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


We can see that unnamed columns has only NaN values. So let's check if those columns has any different values or not.

In [62]:
for u in unnamed:
    print('|',len(data[u].unique()),'|')
    print('-----')
print(data.shape)

| 6 |
-----
| 5 |
-----
| 4 |
-----
| 3 |
-----
| 2 |
-----
| 2 |
-----
(10137, 11)


Let's first clear `'Year'` column which contains only 4 digit as int.

In [63]:
data['Year'] = data['Year'].str[0:4]

In [64]:
data['Year'].head()

0    2010
1    2010
2    2010
3    2010
4    2010
Name: Year, dtype: object

In [65]:
data['Year'] = data['Year'].astype(int)
data['Year'].head()

0    2010
1    2010
2    2010
3    2010
4    2010
Name: Year, dtype: int64

Selecting only awards which are given afer 2000.

In [66]:
later_than_2000 = data[data['Year']>2000]
later_than_2000.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


In [67]:
award_categories = later_than_2000['Category'].unique()[:4]
print(award_categories,len(award_categories))

['Actor -- Leading Role' 'Actor -- Supporting Role'
 'Actress -- Leading Role' 'Actress -- Supporting Role'] 4


In [68]:
nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)].copy()
nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


Now let's replace `'Won'` column values to 1 for yes and 0 for no.<br>
Also change the name of column `'Won?'` to `'Won'`.

In [69]:
replace_dict = {'YES':1,'NO':0}
nominations['Won?'] = nominations['Won?'].map(replace_dict)
nominations['Won'] = nominations['Won?']
nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Won
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0,,,,,,,0
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0,,,,,,,0
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0,,,,,,,0
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1,,,,,,,1
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0,,,,,,,0


In [70]:
print(nominations['Won'].value_counts())

0    160
1     40
Name: Won, dtype: int64


Now let's drop all columns which does not contain information ( all unnames columns and won? column)<br>

In [71]:
nominations.columns[4:11]

Index(['Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Unnamed: 9', 'Unnamed: 10'],
      dtype='object')

In [72]:
dropcol = nominations.columns.values[4:11]
final_nominations = nominations.drop(dropcol,axis=1)
final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0


Now let's try to clean Additional Info.<br>
In this column character name is in the braces while movie name comes first.

In [73]:
additional_info_one = final_nominations['Additional Info'].str.rstrip("'}")

In [74]:
additional_info_two = additional_info_one.str.split(" {")
print(additional_info_two.head())

0                        [Biutiful, 'Uxbal]
1             [True Grit, 'Rooster Cogburn]
2    [The Social Network, 'Mark Zuckerberg]
3      [The King's Speech, 'King George VI]
4                [127 Hours, 'Aron Ralston]
Name: Additional Info, dtype: object


In [75]:
print(additional_info_two.str[0].head())

0              Biutiful
1             True Grit
2    The Social Network
3     The King's Speech
4             127 Hours
Name: Additional Info, dtype: object


Now lets make two more columns for Movie name and Character name.

In [76]:
final_nominations['Movie'] = additional_info_two.str[0]
final_nominations['Character'] = additional_info_two.str[1]
print(final_nominations[['Movie','Character','Additional Info']].head())

                Movie         Character  \
0            Biutiful            'Uxbal   
1           True Grit  'Rooster Cogburn   
2  The Social Network  'Mark Zuckerberg   
3   The King's Speech   'King George VI   
4           127 Hours     'Aron Ralston   

                          Additional Info  
0                      Biutiful {'Uxbal'}  
1           True Grit {'Rooster Cogburn'}  
2  The Social Network {'Mark Zuckerberg'}  
3    The King's Speech {'King George VI'}  
4              127 Hours {'Aron Ralston'}  


Now there is no need for `'Additional Info'` column so lets just drop it.

In [77]:
final_nominations.drop('Additional Info',axis=1,inplace=True)
final_nominations.head()

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,'Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,'Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,'Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,'King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,'Aron Ralston


In [78]:
print(final_nominations.count())

Year         200
Category     200
Nominee      200
Won          200
Movie        200
Character    200
dtype: int64


Now that we have created a cleaned dataset its time to save it in sqlite3 database.

In [79]:
# connecting to the database nominations.db
import sqlite3
conn = sqlite3.connect('nominations.db')

In [81]:
# adding cleaned dataset to nominations table without index
final_nominations.to_sql('nominations',conn,index=False,if_exists='append')

In [82]:
# checking  the tableinfo for all columns
q = 'pragma table_info(nominations)'
print(conn.execute(q).fetchall())

[(0, 'Year', 'INTEGER', 0, None, 0), (1, 'Category', 'TEXT', 0, None, 0), (2, 'Nominee', 'TEXT', 0, None, 0), (3, 'Won', 'REAL', 0, None, 0), (4, 'Movie', 'TEXT', 0, None, 0), (5, 'Character', 'TEXT', 0, None, 0)]


In [83]:
q = 'select * from nominations limit 10;'
conn.execute(q).fetchall()

[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0.0, 'Biutiful', "'Uxbal"),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0.0,
  'True Grit',
  "'Rooster Cogburn"),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0.0,
  'The Social Network',
  "'Mark Zuckerberg"),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1.0,
  "The King's Speech",
  "'King George VI"),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0.0,
  '127 Hours',
  "'Aron Ralston"),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1.0,
  'The Fighter',
  "'Dicky Eklund"),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0.0,
  "Winter's Bone",
  "'Teardrop"),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0.0,
  'The Town',
  "'James Coughlin"),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0.0,
  'The Kids Are All Right',
  "'Paul"),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0.0,
  "The King's Speech",
  "'Lionel Logue")]

In [84]:
# now lets close the connection
conn.close()