# Working with numeric data

#### EXERCISE:
If you expect the data type of a column to be numeric (<code>int</code> or <code>float</code>), but instead it is of type <code>object</code>,
this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the <code>pd.to_numeric()</code> function to convert a column into a numeric data type. If the function
raises an error, you can be sure that there is a bad value within the column. You can either use the techniques
you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore 
or <code>coerce</code> the value into a missing value, <code>NaN</code>.

A modified version of the tips dataset has been pre-loaded into a DataFrame called <code>tips</code>. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the <code>.info()</code> method to explore this. You'll note that the <code>total_bill</code> and <code>tip</code> columns, which should be numeric, are instead of type <code>object</code>. Your job is to fix this.

#### INSTRUCTIONS:
* Use <code>pd.to_numeric()</code> to convert the <code>'total_bill'</code> column of <code>tips</code> to a numeric data type. Coerce the errors to <code>NaN</code> by specifying the keyword argument <code>errors='coerce'</code>.
* Convert the <code>'tip'</code> column of <code>'tips'</code> to a numeric data type exactly as you did for the <code>'total_bill'</code> column.
* Print the <code>info</code> of <code>tips</code> to confirm that the data types of <code>'total_bill'</code> and <code>'tips'</code> are numeric.

#### SCRIPT.PY:

In [20]:
import pandas as pd
tips = pd.read_csv("tips_modified.csv")
tips.sex = tips.sex.astype('category')
tips.smoker = tips.smoker.astype('category')
tips.day = tips.day.astype('category')
tips.time = tips.time.astype('category')
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null object
tip           244 non-null object
sex           234 non-null category
smoker        229 non-null category
day           243 non-null category
time          227 non-null category
size          231 non-null float64
dtypes: category(4), float64(1), object(2)
memory usage: 7.2+ KB


In [21]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips["total_bill"], errors="coerce")

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips["tip"], errors="coerce")

# Print the info of tips
print(tips.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    202 non-null float64
tip           220 non-null float64
sex           234 non-null category
smoker        229 non-null category
day           243 non-null category
time          227 non-null category
size          231 non-null float64
dtypes: category(4), float64(3)
memory usage: 7.2 KB
None
