# Coercing Data

### Introduction

In this lab, we'll work on coercing data from a dataframe.

### Exploring our data

We can start by loading our data from the related `csv` file.

In [123]:
df = pd.read_csv('./nyc_hs_sat.csv', index_col = 0)

columns = ['reading_avg', 'math_avg', 'writing_score']
df[columns] = df[columns].astype('object')

> Press shift + enter on the line above

We use `index_col = 0` to specify that the first column in the csv file will be the index.

> Also we changed some of the columns to be an object so that our data is not so clean right out of the gate.  But let's pretend that didn't happen.

Ok, now let's begin to explore our data.

1. First, view the list of columns in our dataframe.

In [72]:
df_columns = None
df_columns
# Index(['dbn', 'name', 'num_test_takers', 'reading_avg', 'math_avg',
#        'writing_score', 'boro', 'total_students', 'graduation_rate',
#        'attendance_rate', 'college_career_rate'],
#       dtype='object')

2. Now, let's see the columns and the respective datatype of each column.

In [74]:
df_types = None
df.dtypes

# dbn                     object
# name                    object
# num_test_takers        float64
# reading_avg             object
# math_avg                object
# writing_score           object
# boro                    object
# total_students           int64
# graduation_rate        float64
# attendance_rate        float64
# college_career_rate    float64
# dtype: object

dbn                     object
name                    object
num_test_takers        float64
reading_avg             object
math_avg                object
writing_score           object
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
dtype: object

Now if our datatypes are already a number, we're in pretty good shape.  It's where our datatypes are objects that we should do some cleaning up.  Save a dataframe of just the columns that are of type object as `sat_objects_df`.

In [78]:
sat_objects_df = df # change this code here
sat_objects_df.columns

# Index(['dbn', 'name', 'reading_avg', 'math_avg', 'writing_score', 'boro'], dtype='object')

Index(['dbn', 'name', 'num_test_takers', 'reading_avg', 'math_avg',
       'writing_score', 'boro', 'total_students', 'graduation_rate',
       'attendance_rate', 'college_career_rate'],
      dtype='object')

Ok, so we can see that wew have six columns that are of type string.  The first two columns of dbn and name are probably fine as columns.  It's the other ones that we try to change the type of.

Use the `to_numeric` method to change the `num_test_takers` to an `int64` and replace the appropriate data in our `df` dataframe.

In [1]:
# write some code here

In [None]:
df['math_avg'].dtype

# dtype('float64')

Now change the `reading_avg` column to numeric using the `to_numeric` method.

In [2]:
# write code hre

In [94]:

df['reading_avg'].dtype

# dtype('float64')

dtype('float64')

Let's do the same thing with writing score.

In [3]:
# change the writing score to numeric here

Now take another look at all of the datatypes in our dataframe.

In [101]:
df.dtypes

# dbn                     object
# name                    object
# num_test_takers        float64
# reading_avg            float64
# math_avg               float64
# writing_score          float64
# boro                    object
# total_students           int64
# graduation_rate        float64
# attendance_rate        float64
# college_career_rate    float64
# dtype: object

dbn                     object
name                    object
num_test_takers        float64
reading_avg            float64
math_avg               float64
writing_score          float64
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
dtype: object

A lot more of them seem to be an integer or float.  Once again store a dataframe of the columns that are of type `object`. 

In [103]:
object_df = None
object_df.columns

# Index(['dbn', 'name', 'boro'], dtype='object')

Index(['dbn', 'name', 'boro'], dtype='object')

So were down to `boro` as the column that we should change.

### Changing the boro column

To start with, let's see all the numbers of each of the different values in the `boro` column.

In [4]:
# Use the value_counts method

So we can see that there are five different values, just as there are five boroughs.  A little research would reveal that `M -> Manhattan`, `K -> Brooklyn`, `X -> Bronx`, `Q -Queens`, `R -> Staten Island`.  These can be a bit confusing so convert each of them values in the column to it's related string.

In [124]:
# hint: think about how we coerced previous data into boolean data



# Brooklyn         104
# Manhattan         90
# Bronx             87
# Queens            65
# Staten Island     10
# Name: boro, dtype: int64

Brooklyn         104
Manhattan         90
Bronx             87
Queens            65
Staten Island     10
Name: boro, dtype: int64

Ok, now we don't have any cryptic values.  But we want to change this text to numbers. Let's start by changing it from type object to type category.

In [125]:
df['boro'].dtype

dtype('O')

Next, let's change the column to be of type category.

In [6]:
# Write code here


df['boro'].dtype

# CategoricalDtype(categories=['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'], ordered=False)

Ok, next up is to replace each `text` category, currently the borough names, with a respective category number.  

In [129]:
# Write some code here


df['boro'].value_counts()


# 1    104
# 2     90
# 0     87
# 3     65
# 4     10
# Name: boro, dtype: int64

1    104
2     90
0     87
3     65
4     10
Name: boro, dtype: int64

Ok, great now, we can see that almost all of our data in the dataframe is a number of some sort, which is what we want.