# APPLIED DATA CLEANING ON KICKSTARTER DATASET

<img src='https://c3.iggcdn.com/indiegogo-media-prod-cld/image/upload/c_fill,w_695,g_auto,q_auto,dpr_2.6,f_auto,h_460/raayulrjgqrecunugw8y' width=600>

In [1]:
# Import pandas
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Load dataset
# df = pd.read_csv('https://www.dropbox.com/s/k0fyjksq5c6cbvx/kickstarter_data.csv?dl=1', index_col=[0])
df = pd.read_csv('data\\kickstarter_data.csv')

# First 5 rows of the dataframe
df.sample(5)

Unnamed: 0.1,Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
194784,194784,1991781833,Under The Radar's Creative ReLaunch,Faith,Music,USD,2016-11-16,13000.0,2016-10-13 16:20:36,16479.77,successful,223,US,3397.0,16479.77,13000.0
195025,195025,1993032191,Meat is Murdered : a Rock Opera,Theater,Theater,USD,2012-03-02,8000.0,2012-01-21 02:13:57,0.0,failed,0,US,0.0,0.0,8000.0
228431,228431,231242167,Virtual Art Gallery,Digital Art,Art,GBP,2016-09-24,400.0,2016-08-25 20:05:41,3.0,failed,1,GB,0.0,3.89,518.5
156580,156580,1796494261,Cornwalls first gourmet takeaway restaurant,Food,Food,GBP,2014-02-06,3500.0,2014-01-07 18:16:55,90.0,failed,6,GB,147.26,146.94,5714.38
209740,209740,2068971849,2Empires the Novel,Fiction,Publishing,USD,2015-06-05,30651.0,2015-05-05 05:04:05,390.0,failed,7,US,390.0,390.0,30651.0


In [2]:
df = df.iloc[:, 1:]

In [3]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


**ANNOTATION**

- Question: *graded* exercises to complete for score

- Task: *not graded* exercises, highly recommended to follow through

- Dataframe Columns:
    - `goal`: Goal set at the launched time.

    - `pledge`: Total amount of funding the project successfully called.

    - `backers`: Number of investors that fund the project.

    - `usd pledged`: conversion in US dollars of the pledged column (conversion done by kickstarter).

    - `usd_pledge_real`: conversion in US dollars of the pledged column (conversion from Fixer.io API).

    - `usd_goal_real`: conversion in US dollars of the goal column (conversion from Fixer.io API).

The dataset is acquired from Kaggle.com. You can visit it here: https://www.kaggle.com/kemical/kickstarter-projects

🙋🏻‍♂️ **DISCUSSION :** Discuss with your teammate to:

- Understand the meaning of each column
- Is there any column that you feel unnecessary?



# A. OVERVIEW AND CLEAN

## **A.1** - Remove unwanted observations
---

### Task

We have many columns for the pledge and goal with different conversions. For this analysis, we choose to keep only `usd_pledged_real` and `usd_goal_real`. 

Write one line of code to drop the columns `goal`, `pledge`, `usd pledged`.

In [4]:
df.columns

Index(['ID', 'name', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'state', 'backers', 'country',
       'usd pledged', 'usd_pledged_real', 'usd_goal_real'],
      dtype='object')

In [5]:
# YOUR CODE/ANSWER HERE
df.drop(columns = ['goal', 'pledged', 'usd pledged'], inplace = True)

In [6]:
# Check your dataframe again to see if the columns are successfully dropped
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0


For future convenience, let's rename the columns as follows:

- `usd_pledged_real` --> `pledged`
- `usd_goal_real` --> `goal`

Write your code to do that below.

In [7]:
# YOUR CODE HERE
df.rename(columns = {'usd_pledged_real' : 'pledged', 'usd_goal_real': 'goal'}, inplace = True)

In [8]:
# Check your dataframe again to see if your columns are successfully renamed
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0


### Question 1 (5 pts)

Write one line of code to check for duplications (of the whole row). Your code should return only one number, which is the total number of duplicated rows.

In [9]:
# TEST YOUR CODE HERE
df.duplicated().sum()

0

In [10]:
# Check duplicate for ID:
df.ID.duplicated().sum()

0

### Question 2 (5 pts)

How about duplicated values in the column `name`? Which of the following expression gives the number of rows with duplicated names?

<ol type="A">
  <li><code>df['name'].isduplicated().sum()</code></li>
  <li><code>df[df['name'].duplicated()].sum()</code></li>
  <li><code>df['name'].duplicated().sum()</code></li>
  <li><code>df.duplicated().sum()</code></li>
</ol>

In [11]:
# YOUR CODE/ANSWER HERE => C
df.name.duplicated().sum()

2896

### Question 3 (5 pts)

Which of the following expression selects all rows with duplicated names?

<ol type="A">
  <li><code>df(df['name'].duplicated())</code></li>
  <li><code>df[df['name'].duplicated()]</code></li>
  <li><code>df['name'].duplicated()</code></li>
  <li><code>df[df.duplicated()]</code></li>
</ol>

In [12]:
# TEST YOUR CODE HERE => B
df[df.name.duplicated()]

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
6379,1032645935,Cancelled (Canceled),Design,Design,USD,2015-06-05,2015-05-06 12:29:08,canceled,17,US,3105.00,100000.00
7743,1039093987,New EP/Music Development,Music,Music,USD,2016-01-07,2015-11-24 15:29:35,undefined,0,"N,0""",257.00,3800.00
8356,1042208764,The Basement,Horror,Film & Video,USD,2015-05-07,2015-04-07 18:24:19,successful,106,US,12311.00,12000.00
8448,1042642941,The Gift,Film & Video,Film & Video,USD,2013-05-08,2013-04-17 01:55:27,successful,37,US,3370.00,3000.00
8761,1044230780,Redemption,Narrative Film,Film & Video,USD,2012-08-25,2012-06-26 19:13:21,successful,67,US,11440.00,11000.00
...,...,...,...,...,...,...,...,...,...,...,...,...
378140,997542782,Innocent Sin,Indie Rock,Music,USD,2015-01-25,2014-12-26 18:04:28,successful,15,US,600.00,300.00
378224,997919903,Grassroots,Publishing,Publishing,EUR,2017-09-14,2017-08-15 18:36:18,failed,3,IE,25.12,11963.01
378426,998836498,The InAction,Camera Equipment,Technology,USD,2016-07-29,2016-06-28 04:00:08,canceled,5,US,670.00,80000.00
378475,999055513,The Last Hurrah,Rock,Music,USD,2012-06-03,2012-05-04 15:20:41,successful,69,US,7665.00,5500.00


### Task


From duplicated **name**, Let's search for all the rows that have name '**The Gift**'.

In [13]:
# YOUR CODE HERE
df[df.name == 'The Gift']

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
5003,1025568649,The Gift,Shorts,Film & Video,USD,2014-05-10,2014-04-29 03:14:49,successful,27,US,4560.0,4500.0
8448,1042642941,The Gift,Film & Video,Film & Video,USD,2013-05-08,2013-04-17 01:55:27,successful,37,US,3370.0,3000.0
77475,1394078347,The Gift,Shorts,Film & Video,USD,2011-04-09,2011-03-09 00:15:36,failed,0,US,0.0,1500.0
135140,168615922,The Gift,Shorts,Film & Video,USD,2011-12-31,2011-11-01 02:28:03,failed,0,US,0.0,10000.0


## **A.2** - Structural Error, Correct Datatype
---

### Task

Write one line of code to print the overall information of the dataset. Are there any columns that you feel like they have the wrong datatype?

In [14]:
# YOUR CODE HERE
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ID             378661 non-null  int64  
 1   name           378657 non-null  object 
 2   category       378661 non-null  object 
 3   main_category  378661 non-null  object 
 4   currency       378661 non-null  object 
 5   deadline       378661 non-null  object 
 6   launched       378661 non-null  object 
 7   state          378661 non-null  object 
 8   backers        378661 non-null  int64  
 9   country        378661 non-null  object 
 10  pledged        378661 non-null  float64
 11  goal           378661 non-null  float64
dtypes: float64(2), int64(2), object(8)
memory usage: 34.7+ MB


The `launched` and `deadline` should be in `datetime` datatype, so you need to convert them to datetime datatype:

*Hint: pd.to_datetime()*

In [15]:
# Your code here:
df.deadline = pd.to_datetime(df.deadline)
df.launched = pd.to_datetime(df.launched)


Check info one more time to make sure everything goes as plan.

In [16]:
# YOUR CODE HERE
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 12 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   ID             378661 non-null  int64         
 1   name           378657 non-null  object        
 2   category       378661 non-null  object        
 3   main_category  378661 non-null  object        
 4   currency       378661 non-null  object        
 5   deadline       378661 non-null  datetime64[ns]
 6   launched       378661 non-null  datetime64[ns]
 7   state          378661 non-null  object        
 8   backers        378661 non-null  int64         
 9   country        378661 non-null  object        
 10  pledged        378661 non-null  float64       
 11  goal           378661 non-null  float64       
dtypes: datetime64[ns](2), float64(2), int64(2), object(6)
memory usage: 34.7+ MB


In [17]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0


## **A.3** - Handling Missing Values
---

### Question 4 (5 pts)

Which of the following expression(s) give the number of null values in *each* column?

<ol type="A">
  <li><code>df.isna().sum()</code></li>
  <li><code>df.null().sum()</code></li>
  <li><code>df.isnull().sum()</code></li>
  <li><code>sum(df.isnull())</code></li>
  <li><code>df.isna.sum()</code></li>
  <li><code>sum(df.isna())</code></li>
</ol>

In [18]:
# TEST YOUR CODE HERE
df.isna().sum()

ID               0
name             4
category         0
main_category    0
currency         0
deadline         0
launched         0
state            0
backers          0
country          0
pledged          0
goal             0
dtype: int64

In [19]:
df[df.name.isna()]

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
166851,1848699072,,Narrative Film,Film & Video,USD,2012-02-29,2012-01-01 12:35:31,failed,1,US,100.0,200000.0
307234,634871725,,Video Games,Games,GBP,2013-01-06,2012-12-19 23:57:48,failed,12,GB,316.05,3224.97
309991,648853978,,Product Design,Design,USD,2016-07-18,2016-06-18 05:01:47,suspended,0,US,0.0,2500.0
338931,796533179,,Painting,Art,USD,2011-12-05,2011-11-06 23:55:55,failed,5,US,220.0,35000.0


### Task

Write one line of code to fill all the `NaN` values in name with `Unknown`.

In [20]:
# YOUR CODE HERE
df.fillna('Unknown', inplace = True)

Check the number of `NaN` value one more time to make sure we cleaned them all.

In [21]:
# YOUR CODE HERE
df.isna().sum()

ID               0
name             0
category         0
main_category    0
currency         0
deadline         0
launched         0
state            0
backers          0
country          0
pledged          0
goal             0
dtype: int64

## **A.4** - Handling errors, corrupted data
---

Scanning through each column to find abnormalities and fix them. Simple as that.

In [22]:
# Display the dataframe one more time.
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0


### Question 5 (5 pts)

Let's start with `category`. Write an expression to display the frequency of the value in the column `category`. (The unique values and how many times they appear)

In [26]:
pd.set_option('display.max_rows', None)     # show full rows
pd.set_option('display.max_columns', None)  # show full columns

In [25]:
# TEST YOUR CODE HERE
df.category.value_counts()

Product Design        22314
Documentary           16139
Music                 15727
Tabletop Games        14180
Shorts                12357
Video Games           11830
Food                  11493
Film & Video          10108
Fiction                9169
Fashion                8554
Nonfiction             8318
Art                    8253
Apparel                7166
Theater                7057
Technology             6930
Rock                   6758
Children's Books       6756
Apps                   6345
Publishing             6018
Webseries              5762
Photography            5752
Indie Rock             5657
Narrative Film         5188
Web                    5153
Comics                 4996
Crafts                 4664
Country & Folk         4451
Design                 4199
Hip-Hop                3912
Hardware               3663
Pop                    3350
Painting               3294
Games                  3226
Illustration           3175
Accessories            3165
Public Art          

### Question 6 (5 pts)

Everything seems fine. We now move on to `main_category`. Write an expression to display the frequency of the value in the column `main_category`.

In [27]:
# TEST YOUR CODE HERE
df.main_category.value_counts()

Film & Video    63585
Music           51918
Publishing      39874
Games           35231
Technology      32569
Design          30070
Art             28153
Food            24602
Fashion         22816
Theater         10913
Comics          10819
Photography     10779
Crafts           8809
Journalism       4755
Dance            3768
Name: main_category, dtype: int64

### Task

Let's do the same for `currency` and `state`. Find anything abnormal?

In [28]:
# YOUR CODE HERE
df.currency.value_counts()

USD    295365
GBP     34132
EUR     17405
CAD     14962
AUD      7950
SEK      1788
MXN      1752
NZD      1475
DKK      1129
CHF       768
NOK       722
HKD       618
SGD       555
JPY        40
Name: currency, dtype: int64

In [29]:
# YOUR CODE HERE
df.state.value_counts()

failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
Name: state, dtype: int64

### Question 7 (5 pts)

Are there any abnormalities in the column `country`?

<ol type="A">
  <li>Nope, totally fine.</li>
  <li>There is no project in US.</li>
  <li>There are two different values that both represent Canada.</li>
  <li>There is a weird value called <code>N,0"</code>.</li>
</ol>

D

In [30]:
# TEST YOUR CODE HERE
df.country.value_counts()

US      292627
GB       33672
CA       14756
AU        7839
DE        4171
N,0"      3797
FR        2939
IT        2878
NL        2868
ES        2276
SE        1757
MX        1752
NZ        1447
DK        1113
IE         811
CH         761
NO         708
HK         618
BE         617
AT         597
SG         555
LU          62
JP          40
Name: country, dtype: int64

#### *Click to see my solution*

One way to adjust the error in the column `country` is to refer it with the column `currency`.

For example, if the `currency` is `USD`, we can set the value in `country` to `US`.


### Question 8 (5 pts)

Write an expression to select all rows with that weird value above (`N,0"`).

In [33]:
# TEST YOUR CODE HERE
df[df.country == 'N,0"'].sample(5)

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
99059,1502868750,Send QUEL BORDEL! to France!!,Music,Music,USD,2015-03-18,2015-02-23 20:32:00,undefined,0,"N,0""",5846.0,5000.0
275680,472780736,Midnight Carnival,Film & Video,Film & Video,USD,2015-05-28,2015-04-28 16:13:18,undefined,0,"N,0""",3047.0,2500.0
276801,478264169,White Road,Film & Video,Film & Video,USD,2016-03-06,2016-02-05 13:48:53,undefined,0,"N,0""",5930.0,4500.0
270310,445494223,The Adventures of the Three Little Clouds,Publishing,Publishing,USD,2015-12-09,2015-11-09 05:04:15,undefined,0,"N,0""",0.0,6000.0
62319,1316880473,'The Girl On Christopher Street' official Vide...,Music,Music,GBP,2015-11-03,2015-10-02 10:55:28,undefined,0,"N,0""",5086.44,4541.46


### Question 9 (5 pts)

Write one line of code to return the ***unique currencies*** of the projects that have country as `N,0"`?

In [34]:
# TEST YOUR CODE HERE
df[df.country == 'N,0"']['currency'].unique()

array(['USD', 'AUD', 'CAD', 'GBP', 'EUR', 'SEK', 'DKK', 'NZD', 'NOK',
       'CHF'], dtype=object)

### Task

Our mission is apply a check function onto each row of the country-N0" part.

First, define a function that takes in a whole data row. 

- If currency is `USD` ---> country is `US`
- If currency is `AUD` ---> country is `AU`
- If currency is `CAD` ---> country is `CA`
- If currency is `GBP` ---> country is `GB`
- If currency is `SEK` ---> country is `SE`
- If currency is `DKK` ---> country is `DK`
- If currency is `NZD` ---> country is `NZ`
- If currency is `NOK` ---> country is `NO`
- If currency is `CHF` ---> country is `CH`
- If currency is `EUR` ---> country is `DE`

In the `EUR` case, we choose to replace by the mode --- `DE` (Within projects that in `EUR`, the most are from `DE` -- Germany)

In [35]:
def fix_country(row):
    if row['currency'] == 'EUR':
        return 'DE'
    else:
        return row['currency'][:2]

In [37]:
# Apply and then write it back to the dataframe
df.loc[df.country == 'N,0"', 'country'] = df.apply(fix_country, axis = 1)

In [38]:
# Check the column again to make sure the N0" is gone
df['country'].value_counts()

US    295365
GB     34132
CA     14962
AU      7950
DE      4357
FR      2939
IT      2878
NL      2868
ES      2276
SE      1788
MX      1752
NZ      1475
DK      1129
IE       811
CH       768
NO       722
HK       618
BE       617
AT       597
SG       555
LU        62
JP        40
Name: country, dtype: int64

### Question 10 (5 pts)

Let's move on to the numeric columns.

Write one line of code to give the descriptive statistic review of three columns: `backers`, `pledged`, and `goal`.

In [40]:
# TEST YOUR CODE HERE
df[['backers', 'pledged', 'goal']].describe()

Unnamed: 0,backers,pledged,goal
count,378661.0,378661.0,378661.0
mean,105.617476,9058.924,45454.4
std,907.185035,90973.34,1152950.0
min,0.0,0.0,0.01
25%,2.0,31.0,2000.0
50%,12.0,624.33,5500.0
75%,56.0,4050.0,15500.0
max,219382.0,20338990.0,166361400.0


💡 **Tips:** Your question right now is what the heck is `e+05` and `e+04`. This in Python called scientific style. `e+04` means `*10e4` or `*10000`.

If you don't like it, you can use the syntax below. After you run the code, all the report later will be printed in 2 decimal float format.

In [42]:
pd.options.display.float_format = "{:.2f}".format
# let's run df.describe() again:
df[['backers', 'pledged', 'goal']].describe()

Unnamed: 0,backers,pledged,goal
count,378661.0,378661.0,378661.0
mean,105.62,9058.92,45454.4
std,907.19,90973.34,1152950.06
min,0.0,0.0,0.01
25%,2.0,31.0,2000.0
50%,12.0,624.33,5500.0
75%,56.0,4050.0,15500.0
max,219382.0,20338986.27,166361390.71


Everything seems fine. No projects have abnormality in these numeric columns.

### Question 11 (10 pts)

👑 **The best project** --- Write one line of code to get the row of the project that have the max pledged.

In [43]:
# TEST YOUR CODE HERE
df[df.pledged == df.pledged.max()]

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
157270,1799979574,"Pebble Time - Awesome Smartwatch, No Compromises",Product Design,Design,USD,2015-03-28,2015-02-24 15:44:42,successful,78471,US,20338986.27,500000.0


You have done a lot of coding. Now, take a bit time off, google and read about this awesome product design project called `'Pebble Time - Awesome Smartwatch, No Compromises'` that attracts the most pledge on Kickstarter history. 

### Question 12 (10 pts)

❤️ **The top favorite** --- Write one line of code to get the row of the project that have the max backers.

In [44]:
# TEST YOUR CODE HERE
df[df.backers == df.backers.max()]

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
187652,1955357092,Exploding Kittens,Tabletop Games,Games,USD,2015-02-20,2015-01-20 19:00:19,successful,219382,US,8782571.99,10000.0


Is the product sounds familiar? You can buy this at any convenient store in Vietnam nowaday. 🥳

### Question 13 (10 pts)

🤑 **The most ambitious** --- Write one line of code to get the row of the project that set the max goal.

In [45]:
# TEST YOUR CODE HERE
df[df.goal == df.goal.max()]

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
47803,1243678698,FUCK Potato Salad. Paleo Potato Brownies!,Food,Food,GBP,2014-08-08,2014-07-09 00:24:34,failed,0,GB,0.0,166361390.71


### Finally, the last two columns left are the two datetime `launched` and `deadline`.

### Question 14 (10 pts)

Write one line of code to get the minimum value of the column `launched`.

In [46]:
# TEST YOUR CODE HERE
df.launched.min()

Timestamp('1970-01-01 01:00:00')

### Question 15 (10 pts)

Write one line of code to get the maximum value of the column `launched`.

In [47]:
# TEST YOUR CODE HERE
df.launched.max()

Timestamp('2018-01-02 15:02:31')

- The earliest data --- 1970 doesn't make sense at all. So we filter out all the data that set launched year before the founding of Kickstarter (2009).

- The latest data is in the second day of 2018. That's not enough to have a view for 2018 and might effect to analysis in year level, or month level. So we exclude the incomplete data of 2018.

👉 Do you still remember how to extract datetime components from a date:

```python
# Extract year, month, day
date_series.dt.year
date_series.dt.month
date_series.dt.day

# Extract hour, minute, second
date_series.dt.hour
date_series.dt.minute
date_series.dt.second

# Extract dayofweek, week, quarter
date_series.dt.dayofweek
date_series.dt.isocalendar().week
date_series.dt.quarter

# Extract year-month
date_series.dt.to_period('M')
```

+ Choose to work with data from the beginning of the year 2009 to the end of the year 2017 only.

In [48]:
# Choose to work with data from the beginning of the year 2009 to the end of the year 2017 only.
# FILL-IN THE ___ BELOW:
df2 = df[(df['launched'].dt.year > 2009) & (df['launched'].dt.year < 2018)]

In [49]:
df2.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0


In [50]:
# Check the column one more time
df2['launched'].min()

Timestamp('2010-01-01 01:41:03')

In [51]:
df2['launched'].max()

Timestamp('2017-12-31 23:37:20')

### Final Task

The last task that we should do in cleaning this dataset is to create new columns extract `day`, `month`, `year` from the two columns `launched` and `deadline`. This will help us in the future when we analyse on year, on month, or on day. For example: number of project by year. 

In data analysis, this is a simple feature engineering.

In [52]:
df['launched_day'] =    df['launched'].dt.day
df['launched_month'] =  df['launched'].dt.month
df['launched_year'] =   df.launched.dt.year

In [53]:
# Do the same thing with deadline column
# YOUR CODE HERE
df['deadline_day'] =    df['deadline'].dt.day
df['deadline_month'] =  df['deadline'].dt.month
df['deadline_year'] =   df.deadline.dt.year

In [54]:
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal,launched_day,launched_month,launched_year,deadline_day,deadline_month,deadline_year
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95,11,8,2015,9,10,2015
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0,2,9,2017,1,11,2017
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0,12,1,2013,26,2,2013
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0,17,3,2012,16,4,2012
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0,4,7,2015,29,8,2015


YAYYY!!! WE HAVE DONE EVERTHING. 🤩 Finally you got a clean dataset that is ready for analysis. 

Let's view out beautiful dataset again.

In [55]:
# View our data again
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal,launched_day,launched_month,launched_year,deadline_day,deadline_month,deadline_year
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95,11,8,2015,9,10,2015
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0,2,9,2017,1,11,2017
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0,12,1,2013,26,2,2013
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0,17,3,2012,16,4,2012
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0,4,7,2015,29,8,2015


As you can see, cleaning data is a meticulous process that takes a lot of time. In real life, it can take up to days and requires a lot of domain knowledge. Treat this notebook as a guideline or a case study to start with. Be creative when adapt to your personal project! Good luck 🥸 🧚🏻‍♂️

# B. FUN TASK:

Using IQR to filter the projects that set the goal way too ambitious or too humble.

In [56]:
# Create a new column called 'exceed' which is the difference of 'pledged' and 'goal'
# YOUR CODE HERE
df['exceed'] = df.pledged - df.goal
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal,launched_day,launched_month,launched_year,deadline_day,deadline_month,deadline_year,exceed
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95,11,8,2015,9,10,2015,-1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0,2,9,2017,1,11,2017,-27579.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0,12,1,2013,26,2,2013,-44780.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0,17,3,2012,16,4,2012,-4999.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0,4,7,2015,29,8,2015,-18217.0


In [57]:
# Now apply IQR on the column 'exceed', calculate the upper whisker and lower whisker
# YOUR CODE HERE
q1 = df.exceed.quantile(0.25)
q3 = df.exceed.quantile(0.75)
iqr = q3 - q1
upper = q3 + iqr*1.5
lower = q1 - iqr*1.5

lower, upper

(-25297.5, 15498.5)

Everything above the upper whisker means that the project attracts A LOT of money compared to its original goal. 

In [58]:
# Filter out the projects that above the upper whisker.
# YOUR CODE HERE
df[df.exceed >= upper].head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal,launched_day,launched_month,launched_year,deadline_day,deadline_month,deadline_year,exceed
14,1000057089,Tombstone: Old West tabletop game and miniatur...,Tabletop Games,Games,GBP,2017-05-03,2017-04-05 19:44:18,successful,761,GB,121857.33,6469.73,5,4,2017,3,5,2017,115387.6
27,1000104688,Permaculture Skills,Webseries,Film & Video,CAD,2014-12-14,2014-11-14 18:02:00,successful,571,CA,42174.03,15313.04,14,11,2014,14,12,2014,26860.99
31,1000117861,Ledr workbook: one tough journal!,Product Design,Design,USD,2016-10-08,2016-09-07 13:14:26,successful,549,US,47266.0,1000.0,7,9,2016,8,10,2016,46266.0
46,1000183112,Hot Chicken Takes Over.,Restaurants,Food,USD,2014-10-16,2014-09-16 02:31:08,successful,855,US,63401.0,40000.0,16,9,2014,16,10,2014,23401.0
63,1000235643,HIIT Bottle™,Drinks,Food,USD,2015-04-27,2015-03-13 18:33:08,successful,2784,US,124998.0,15000.0,13,3,2015,27,4,2015,109998.0


In [59]:
df[df.exceed >= upper].shape

(11766, 19)

Everything below the lower whisker means that the project set the goal way TOO HIGH compared to its real potential.

In [61]:
# Filter out the projects that below the lower whisker.
# YOUR CODE HERE
df[df.exceed < lower].sample(5)

Unnamed: 0,ID,name,category,main_category,currency,deadline,launched,state,backers,country,pledged,goal,launched_day,launched_month,launched_year,deadline_day,deadline_month,deadline_year,exceed
52207,1265709349,Absolutism: the Art of Composition and Combina...,Art,Art,USD,2014-01-25,2013-12-26 21:48:07,canceled,4,US,301.0,32000.0,26,12,2013,25,1,2014,-31699.0
33073,1168077047,This Vivid Thing,Narrative Film,Film & Video,GBP,2014-06-21,2014-05-22 22:13:42,failed,7,GB,407.08,102195.5,22,5,2014,21,6,2014,-101788.42
271406,450775778,"Transition Point! Revolution, Evolution or End...",Nonfiction,Publishing,GBP,2014-08-09,2014-07-10 14:52:02,failed,23,GB,4509.65,30876.56,10,7,2014,9,8,2014,-26366.91
219284,2116958480,SC RACER: The small controller for racing games,Hardware,Technology,EUR,2015-10-23,2015-09-23 12:30:09,failed,7,ES,755.93,55420.08,23,9,2015,23,10,2015,-54664.15
272503,456479211,NOT! Your Grandma's Water Cooler,Hardware,Technology,CAD,2015-09-25,2015-08-11 14:37:12,failed,15,CA,3385.41,37557.27,11,8,2015,25,9,2015,-34171.86


In [62]:
df[df.exceed < lower].shape

(48326, 19)

From here, you can further explore the "too successful" projects and the "too failed" projects. For example, what kind of category tend to be too-successful, what opposite? Which country they are from? etc. 

Feel free to code and post your insights to Community. HAVE FUNNN 🧙🏻‍♂️