# Introduction

In this tutorial, you'll learn how to investigate data types within a DataFrame or Series.  You'll also learn how to find and replace entries.

# Dtypes

The data type for a column in a DataFrame or a Series is known as the **dtype**.

You can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `price` column in the `salary data` DataFrame:

In [3]:

import pandas as pd
data = pd.read_csv("Salary_data.csv", index_col=0)
pd.set_option('max_rows', 5)

In [5]:
data.education.dtype

dtype('O')

Alternatively, the `dtypes` property returns the `dtype` of _every_ column in the DataFrame:

In [6]:
data.dtypes

Salary       int64
Skill       object
            ...   
expense    float64
savings    float64
Length: 6, dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the `object` type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `points` column from its existing `int64` data type into a `float64` data type:

In [8]:
data.Salary.astype('float64')

YearsExperience
100.0      3900.0
110.0    390000.0
           ...   
10.3     122391.0
10.5     121872.0
Name: Salary, Length: 30, dtype: float64

A DataFrame or Series index has its own `dtype`, too:

In [9]:
data.index.dtype

dtype('float64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.

# Missing data

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [13]:
data[pd.isnull(data.Skill)]

Unnamed: 0_level_0,Salary,Skill,Age,education,expense,savings
YearsExperience,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2.0,43525,,26,bachelor,34820.0,8705.0
3.7,57189,,22,bachelor,45751.2,11437.8
4.1,57081,,23,masters,45664.8,11416.2


Replacing missing values is a common operation.  Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN` with an `"Unknown"`:

In [12]:
data.Skill.fillna("Unknown")

YearsExperience
100.0       c++
110.0    python
          ...  
10.3     python
10.5        c++
Name: Skill, Length: 30, dtype: object

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on.
