<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/28Mar20_4_altering_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Replacing Data With Map

### Introduction

In the last lesson, we successfully trained our machine learning model, and there was only one feature that we were unable to include -- that of borough.  In this lesson, we will see how we can use the map method to coerce this data into numbers.  In a later lesson, we will see a different approach with categorical variables.

### Initial Loading and Cleaning

Let's take another look at our SAT data from the last lab.

In [0]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/2-coercing-data/nyc_hs_sat.csv?token=ANKFJMC7BYDRMREKZA7QBG26QVGK6"
sat_df = pd.read_csv(url, index_col = 0)

We'll start by dropping the rows that contain missing values.

In [0]:
dropped_sat_df = sat_df.dropna()

Let's confirm that we no longer have `na` values in our data.

In [0]:
dropped_sat_df.isna().sum()

dbn                    0
name                   0
num_test_takers        0
reading_avg            0
math_avg               0
writing_score          0
boro                   0
total_students         0
graduation_rate        0
attendance_rate        0
college_career_rate    0
dtype: int64

Looks pretty good.  Now, as we know, we still cannot use the column `boro` as the values in it are text and not numeric, but perhaps they could be.  Let's tackle that in the next section.  

### Exploring and Mapping

Now we currently have three columns in our dataset that are non-numeric: `dbn`, `name`, and `boro`. Now, there is not an easy way of representing `dbn` and `name` as meaningful numbers.

In [0]:
dropped_sat_df[['dbn', 'name']][:2]

Unnamed: 0,dbn,name
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL


As we can see, each of the values in those columns are different.  But a little data exploration will reveal that the values in the `boro` column are constrained to five different values, one for each borough of New York City.  A good way to see this, is by using the `value_counts` method, which is available on a pandas series.

In [0]:
dropped_sat_df['boro'].value_counts()

K    96
X    80
M    77
Q    60
R    10
Name: boro, dtype: int64

```
M -> Manhattan
Q -> Queens
X - Bronx
K -> Brooklyn
R -> Staten Island
```

Now let's change these letters to the corresponding borough name in our dataset.We can do so with the `map` function.  Here's how:

In [0]:
mapping = {'M': 'Manhattan', 'Q': 'Queens', 'X': 'Bronx', 'K': 'Brooklyn', 'R': 'Staten Island'}
mapped_borough = dropped_sat_df['boro'].map(mapping)

Let's see what this did.

In [0]:
mapped_borough.value_counts()

Brooklyn         96
Bronx            80
Manhattan        77
Queens           60
Staten Island    10
Name: boro, dtype: int64

In [0]:
mapped_borough[:3]

0    Manhattan
1    Manhattan
2    Manhattan
Name: boro, dtype: object

So we can see that we provided `map` a dictionary.  And map changed each of the values that matched a key in our dictionary to the corresponding value, here a number.

### Going further with map

Now so far, we used `map` with a dictionary to change one value to another.

In [0]:
mapping = {'M': 'Manhattan', 'Q': 'Queens', 'X': 'Bronx', 'K': 'Brooklyn', 'R': 'Staten Island'}
mapped_borough = dropped_sat_df['boro'].map(mapping)
mapped_borough.value_counts()

Brooklyn         96
Bronx            80
Manhattan        77
Queens           60
Staten Island    10
Name: boro, dtype: int64

But another way that we can use map is to take a function.  Let's see this with some datetime data.  Let's load up our Max's Wine Bar data, and get started.

In [0]:
import pandas as pd

url = "https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/2-coercing-data/max-revenue.json?token=ANKFJMBT7AUKZPSIA47CLTC6QVGHG"
df_max = pd.read_json(url)
df_max[:3]

Unnamed: 0,end_date,total_receipts
0,2016-12-31T00:00:00.000,56182
1,2017-08-31T00:00:00.000,9400
2,2016-06-30T00:00:00.000,50574


Because `end_date` is currently of type object, let's convert it to be a `datetime`.

In [0]:
end_date_dt = df_max['end_date'].astype('datetime64[ns]')
end_date_dt[:2]

0   2016-12-31
1   2017-08-31
Name: end_date, dtype: datetime64[ns]

Ok, now that `end_date_dt` is of type `datetime`, we can call methods `month` and `year` on individual entries. 

In [0]:
end_date_dt[0].month

12

But we would like to create a `end_date_month` where we extract the month from every entry.  We can do so, also with the `map` method, like so.

In [0]:
end_date_month = end_date_dt.map(lambda date: date.month)
end_date_month[:2]

0    12
1     8
Name: end_date, dtype: int64

So here, inside the `map` argument works by looping through the data, and each entry takes a turn being represented by the `date` variable.  After the colon, we indicate what information we would like to replace the date entry with -- here the date's month.

If we would prefer, we can also break this into steps by first writing a function called `date_to_year`.

In [0]:
def date_to_month(date):
    return date.month

In [0]:
end_date_dt[0]
# Timestamp('2016-12-31 00:00:00')

date_to_month(end_date_dt[0])

12

In [0]:
months = end_date_dt.map(lambda date: date_to_month(date))
months[:2]

0    12
1     8
Name: end_date, dtype: int64

Try the to write the corresponding code to extract the date's year.  Remember we can go from one date to a year, with the `month` method.

In [0]:
end_date_year = end_date_dt
end_date_year[:2]

# 0    2016
# 1    2017
# Name: end_date, dtype: int64

0   2016-12-31
1   2017-08-31
Name: end_date, dtype: datetime64[ns]

So the format for `map` is `lambda variable: method(variable)`.

### Summary

In this lesson, we saw how to coerce our data with the `map` function.  We saw that this can convert matching strings to other values.  We saw that our map function can accept both a dictionary to change one value to another, or we can use `lambda variable`, and replace each of our entry values with the value returned from the lambda statement.