<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/28Mar20_5_explore_and_map_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explore and Map Lab

### Introduction

In the last lesson, we saw some methods for identifying data that we can coerce into numbers.  Let's get some practice with identifying and coercing data in this lab.

## Loading and Exploring our Data

Let's begin by loading up our data at the following url.

In [0]:
url = 'https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/2-coercing-data/semi-cleaned-imdb.csv?token=ANKFJMA5DF6AJEZ6CKO4C6C6QVGIE'

Import the pandas library, and read the csv file, storing the data in a dataframe as the variable `df`.

In [0]:
import pandas as pd
df = None
df[:2]
# 	title	budget	runtime	original_language	release_date	revenue	genre
# 0	Avatar	237000000	162.0	en	2009-12-10	2787965087	Action
# 1	Pirates of the Caribbean: At World's End	300000000	169.0	en	2007-05-19	961000000	Adventure

Ok, now it's time to explore some of our data.  We want to identify the datatypes of each of the columns.

In [0]:
None

# title                 object
# budget                 int64
# runtime              float64
# original_language     object
# release_date          object
# revenue                int64
# genre                 object
# dtype: object

So we can see that a number of the columns are objects, let's select just the columns that are of type `object` from the dataframe.

In [0]:
object_df = None
object_df[:2]

# title	original_language	release_date	genre
# 0	Avatar	en	2009-12-10	Action
# 1	Pirates of the Caribbean: At World's End	en	2007-05-19	Adventure

Unnamed: 0,title,original_language,release_date,genre
0,Avatar,en,2009-12-10,Action
1,Pirates of the Caribbean: At World's End,en,2007-05-19,Adventure


Now let's just view the columns in the `object_df`.

In [0]:
object_cols = None
object_cols

Ok, so `title` will not be used as a feature in our model, but potentially the other columns of `original_language`, `release_date` and `genre` can be used.  Let's focus on `original_language`, and keep going from there.

Ok, so to see how easy it would be to change `original_language` let's look at the various values in `original_language`.

In [0]:
df.original_language.value_counts()

en    1953
fr      13
zh      11
ru       5
ja       4
es       4
de       3
ko       2
cn       2
it       1
hi       1
te       1
Name: original_language, dtype: int64

### Coercing Data

Ok, so it looks like we perhaps change this to a column of `in_english`, and store True if the movie is in English and False otherwise.  Let's use our `map` function, starting with using it with a dictionary.

In [0]:
mapping = None

In [0]:
lan_bool = None
lan_bool.value_counts()

# True     1953
# False      47
# Name: original_language, dtype: int64

Now let's use `map` with a method to change `en` to True, and every other value to False.

In [0]:
def lan_to_bool(language):
    pass

In [0]:
lan_bool = None

In [0]:
lan_bool.value_counts()
# True     1953
# False      47
# Name: original_language, dtype: int64

True     1953
False      47
Name: original_language, dtype: int64

Ok, now that we have our data in the format we want, let's copy our original dataframe, and add a column for `in_english`.  We'll do this for you.

In [0]:
df_updated = df.copy()

df_updated['in_english'] = lan_bool

In [0]:
df_updated[:2]

Unnamed: 0,title,budget,runtime,original_language,release_date,revenue,genre,in_english
0,Avatar,237000000,162.0,en,2009-12-10,2787965087,Action,True
1,Pirates of the Caribbean: At World's End,300000000,169.0,en,2007-05-19,961000000,Adventure,True


Next, drop the `original_language` column.

In [0]:
df_with_en_col = df_updated.drop(columns= ['original_language'])

'original_language' in df_with_en_col.columns

# False

In [0]:
df_with_en_col.select_dtypes('object')[:3]

Unnamed: 0,title,release_date,genre
0,Avatar,2009-12-10,Action
1,Pirates of the Caribbean: At World's End,2007-05-19,Adventure
2,Spectre,2015-10-26,Action


We can see that now we only have the `release_date` column to clean up.  It's currently of type object, but we can change that.  

### Working with DateTimes

Yes, we've never worked with datetimes before, but that doesn't mean we can't try it in a lab.  Don't worry, we'll provide some help.  Take a look at the first value of `release_date`.

In [0]:
df.release_date[0]

'2009-12-10'

It's a simple string -- which makes sense, considering the datatype of the series is of type object.  Now to feed this into our model, we could perhaps create a column for the `month_released` and `year_released`.  To do this, we first want to change our series from a type object to a type `datetime`.  This is easy enough.

We can use the `astype` method to do this.

In [0]:
df.release_date.astype('datetime64[ns]')

0      2009-12-10
1      2007-05-19
2      2015-10-26
3      2012-07-16
4      2012-03-07
          ...    
1995   2000-02-18
1996   2001-03-30
1997   2013-12-18
1998   2001-10-05
1999   2013-12-05
Name: release_date, Length: 2000, dtype: datetime64[ns]

Or even easier, we can use the `pd.to_datetime` method.

In [0]:
release_date_dt = pd.to_datetime(df.release_date)
release_date_dt

0      2009-12-10
1      2007-05-19
2      2015-10-26
3      2012-07-16
4      2012-03-07
          ...    
1995   2000-02-18
1996   2001-03-30
1997   2013-12-18
1998   2001-10-05
1999   2013-12-05
Name: release_date, Length: 2000, dtype: datetime64[ns]

Ok, great.  Now that each of our values is a datetime, there are various methods we can use to extract data from our datetimes.

For example, this is our first datetime.

In [0]:
first_dt = release_date_dt[0]
first_dt

Timestamp('2009-12-10 00:00:00')

And now we can get the month or year from that datetime.

In [0]:
first_dt.month

12

In [0]:
first_dt.year

2009

Now of course what we would like to do is just easily convert each of our values to a month and year, with something like.

In [0]:
# release_date_dt.month

But doing so will result in an error.  So instead, we should go through each entry in our series and call month or year on the datetime entry.  Sounds like a job for map.  We'll do the `month` conversion for you, showing how to use `map` without an intermediate function.

In [0]:
release_month = release_date_dt.map(lambda release_date: release_date.month)
release_month[:3]

0    12
1     5
2    10
Name: release_date, dtype: int64

Ok, now convert our values in `release_date_dt` to the corresponding year.  Assign the series to the value `release_year`.

In [0]:
release_year = None

Now let's add these new columns to our dataframe.

> First, we'll look again at our current columns.

In [0]:
df_with_en_col.columns

Index(['title', 'budget', 'runtime', 'release_date', 'revenue', 'genre',
       'in_english'],
      dtype='object')

So we want copy our old dataframe, remove the `release_date` column and add a column for `release_year` and `release_month`.  Let's do it.

In [0]:
df_with_en_col_copy = df_with_en_col.copy()
df_release = df_with_en_col_copy.drop(columns = ['release_date'])

In [0]:
df_release['release_year'] = release_year

Now assign the column `release_month` to the `release_month` series.

In [0]:
df_release['release_month'] = release_month

Finally, let's take a look at our new dataframe's datatypes.

In [0]:
df_release.columns

Index(['title', 'budget', 'runtime', 'revenue', 'genre', 'in_english',
       'release_year', 'release_month'],
      dtype='object')

And let's select just the datatypes that are of type object.

In [0]:
df_release.select_dtypes('object')[:3]

Unnamed: 0,title,genre
0,Avatar,Action
1,Pirates of the Caribbean: At World's End,Adventure
2,Spectre,Action


And if we look at all of the other columns, we see that the rest of our data is numeric.

In [0]:
df_release.select_dtypes(exclude = 'object')[:2]

Unnamed: 0,budget,runtime,revenue,in_english,release_year,release_month
0,237000000,162.0,2787965087,True,2009,12
1,300000000,169.0,961000000,True,2007,5
