# Replacing Data: Missing Data And Map

### Introduction

In our last lab, we were able to gather data from a our CSV file and coerce much of our data into numbers to ultimately use this data to train a machine learning model.  There are a couple of places where we were stuck.  In this lesson, we'll learn how to finish cleaning our data by working with missing values, and working with the map method.

### Our SAT Data - Not as Clean as We Thought :(

Let's take another look at our SAT data from the last lab.

In [2]:
import pandas as pd
sat_df = pd.read_csv('./nyc_hs_sat.csv', index_col = 0)

In [3]:
sat_df.dtypes

dbn                     object
name                    object
num_test_takers        float64
reading_avg            float64
math_avg               float64
writing_score          float64
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
dtype: object

Looking at the above data, it appears that we have a good number of features to predict our target of `math_avg` for a school, and we would like to begin training our model.  Let's try it.  We select our target column and our feature columns.

In [8]:
y = sat_df.math_avg

In [9]:
X = sat_df.select_dtypes(exclude = ['object']).drop(columns = ['math_avg'])
X.columns

Index(['num_test_takers', 'reading_avg', 'writing_score', 'total_students',
       'graduation_rate', 'attendance_rate', 'college_career_rate'],
      dtype='object')

Ok, so we have our target of `y` assigned to be our `math_avg`, and as features we have the columns listed above, all of which are numeric.  If we try our model we will get the following error.

In [12]:
# from sklearn.tree import DecisionTreeRegressor

# model = DecisionTreeRegressor()
# model.fit(X, y)

**Input contains NaN, infinity or a value too large for dtype('float32')**

The problem is that our dataset has missing values, which does not count as a number.  Let's identify where our missing values are located, and then we can discuss how to deal with them.

### Working with Missing Values

Missing values (if we're lucky) are generally identified with the value `na` which stands for not available.  We can identify the number of missing values in each column with the following line of code.

In [13]:
sat_df.isna().sum()

dbn                     0
name                    0
num_test_takers        29
reading_avg            29
math_avg               29
writing_score          29
boro                    0
total_students          0
graduation_rate         5
attendance_rate         0
college_career_rate     5
dtype: int64

Now we can see that `29` a number of columns have missing values -- and we will have to eliminate this `na` data before training our model.  What to do with missing values warrants a longer discussion, but for now, we can simply drop the rows that contain missing values.  Here's how.

In [14]:
dropped_sat_df = sat_df.dropna()

> The method `dropna` returns a new, updated, dataframe so be sure to store this new dataframe in a variable.

And now we can see that none of the columns have `na` values.

In [15]:
dropped_sat_df.isna().sum()

dbn                    0
name                   0
num_test_takers        0
reading_avg            0
math_avg               0
writing_score          0
boro                   0
total_students         0
graduation_rate        0
attendance_rate        0
college_career_rate    0
dtype: int64

And now we can successfully train our model.

In [16]:
X = dropped_sat_df.select_dtypes(exclude = ['object']).drop(columns = ['math_avg'])
X.columns

y = dropped_sat_df.math_avg

In [17]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

Woohoo!  If no one is looking, do a little shake.

### Summary

In this lesson, we saw that all of our machine learning model data being numeric means that we must not have any `na` values in our training data.  We can discover how many `na` values are in each column with the line:

```python
df.isna().sum()
```

And we can drop our those rows with missing data in a column with the line:

`dropped_df = df.dropna()`