<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/28Mar20_2_pd_explore_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explore and Coerce Pandas DataTypes 

### Introduction

In the last lesson we saw that even though some of our columns may start off as datatype object, and thus not directly usable in a machine learning model, that we develop techniques for extracting numeric information from that data.  In this lab we'll practice exploring the different datatypes of our pandas dataframe, and making some of the data in our columns to be numeric.

### Loading and Exploring Our Data

Loading our data.

In [0]:
import pandas as pd
url = 'https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/nyc_hs_sat.csv'
df = pd.read_csv(url, index_col = 0)

# to make things more interesting, we also alter some of the data
columns = ['math_avg', 'writing_score']
df[columns] = df[columns].astype('object')
str_cols = df[columns].apply(lambda x: x.map(str))
df = df.drop(columns = columns)
sat_df = pd.concat([df, str_cols], axis = 1)

> Press shift + enter to load the data above

In [0]:
sat_df[:2]

Unnamed: 0,dbn,name,num_test_takers,reading_avg,boro,total_students,graduation_rate,attendance_rate,college_career_rate,math_avg,writing_score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,M,171,0.66,0.87,0.36,404.0,363.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,M,465,0.9,0.93,0.7,423.0,366.0


Now let's begin exploring our data.  Let's start by looking at all of the datatypes for each column.

In [0]:
sat_df_datatypes = None
sat_df_datatypes

# dbn                     object
# name                    object
# num_test_takers        float64
# reading_avg            float64
# boro                    object
# total_students           int64
# graduation_rate        float64
# attendance_rate        float64
# college_career_rate    float64
# math_avg                object
# writing_score           object
# dtype: object

We can see that there are multiple columns that are of type `object` which we can potentially clean.  Select just those columns that are of type of object and assign the resulting dataframe to the variable `object_df`.

In [0]:
object_df = None

object_df.columns

# Index(['dbn', 'name', 'boro', 'math_avg', 'writing_score'], dtype='object')

Ok, so it looks like `math_avg` and `writing_score` are two columns that could be converted to become numeric.

### Coercing our data

Let's a look at the first entry in the `writing_score` column.  

In [0]:
sat_df.writing_score[0]

'363.0'

Let's convert this column to be of type `float64` and assign this series to the variable `writing` using the `to_numeric` method.

In [0]:
writing = sat_df.writing_score.astype('float64')

writing.dtype
# dtype('float64')

dtype('float64')

Now let's do the same with the `math_avg`.  This time, do not use the `to_numeric` method to coerce the data, but use the `astype` method to change the datatype from object to float64.

In [0]:
math = sat_df.math_avg

math.dtype
# dtype('float64')

dtype('O')

Ok, now that we two coerced columns stored as `math` and `writing` it's time to update our dataframe.  We'll copy the dataframe for you, so that we do not change the original.  Then, let's update this *copied* dataframe to have the new `float64` type columns.

In [0]:
copied_sat_df = sat_df.copy()

In [0]:
copied_sat_df['math_avg'] = math

In [0]:
copied_sat_df['writing_score'] = writing

In [0]:
copied_sat_df.dtypes

# dbn                     object
# name                    object
# num_test_takers        float64
# reading_avg            float64
# boro                    object
# total_students           int64
# graduation_rate        float64
# attendance_rate        float64
# college_career_rate    float64
# math_avg               float64
# writing_score          float64

dbn                     object
name                    object
num_test_takers        float64
reading_avg            float64
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
math_avg               float64
writing_score          float64
dtype: object

So we now have a lot of data that we can now use in our model.

In [0]:
copied_sat_df.select_dtypes(exclude = ['object'])[:2]

Unnamed: 0,num_test_takers,reading_avg,total_students,graduation_rate,attendance_rate,college_career_rate,math_avg,writing_score
0,29.0,355.0,171,0.66,0.87,0.36,404.0,363.0
1,91.0,383.0,465,0.9,0.93,0.7,423.0,366.0


And only `boro` remains as a column that we would like to change to be numeric so that we can use it as a feature in predicting math SAT scores.  

In [0]:
copied_sat_df.select_dtypes('object')[:2]

NameError: ignored

We'll learn how to finish cleaning this dataset in future lessons. 

### Summary

In this lab we practiced both exploring and coercing our data.  There is still a little more work to do before we can train our model.  We'll see what's left in the next lesson.