# Exploring Data

### Introduction

Now that we know how to work with a dataframe and select individual columns, it's time for us to see if we can begin to understand our data.

### Exploring a DataFrame

So let's get a better sense of the data in our FEMA dataset.  The first thing we'll do is load up the data, and then perhaps look at the columns available.

In [1]:
import pandas as pd
url = 'houston_claims.csv'
claims_df = pd.read_csv(url, index_col = 0)
claims_df[:3]

Unnamed: 0,reportedCity,dateOfLoss,elevatedBuildingIndicator,floodZone,latitude,longitude,lowestFloodElevation,amountPaidOnBuildingClaim,amountPaidOnContentsClaim,yearofLoss,reportedZipcode,id
0,HOUSTON,2017-08-27T00:00:00.000Z,False,X,29.7,-95.5,,195857.43,0.0,2017-01-01T00:00:00.000Z,77096,5e398d6774cbd479fc898dea
1,HOUSTON,2008-09-12T00:00:00.000Z,False,X,29.5,-95.1,,0.0,0.0,2008-01-01T00:00:00.000Z,77058,5e398d6774cbd479fc898dfc
2,HOUSTON,2004-06-29T00:00:00.000Z,False,X,29.8,-95.6,,1420.89,0.0,2004-01-01T00:00:00.000Z,77042,5e398d6774cbd479fc898e4b


### Viewing Sample Statistics

In pandas, we can view sample statistics either on a column or across a dataframe.  For example, we can look at the average `amountPaidOnBuildingClaim` in the dataset.

In [3]:
claims_df.amountPaidOnBuildingClaim.mean()

66239.67425949442

> So we can see that the average amount paid on a claim is `66239.67425949442`.

Or we can look at the mean across our numeric columns.

In [4]:
claims_df.mean()

elevatedBuildingIndicator        0.046052
latitude                        29.779968
longitude                      -95.448202
lowestFloodElevation            61.329824
amountPaidOnBuildingClaim    66239.674259
amountPaidOnContentsClaim    20042.823729
reportedZipcode              77040.538700
dtype: float64

> Notice that because we asked for a numerical statistic, pandas only calculated the average for the columns with numerical values.

Now if we want to get an overview of the data in each of the columns, we can do so with the describe method.

In [7]:
claims_df.describe()

Unnamed: 0,latitude,longitude,lowestFloodElevation,amountPaidOnBuildingClaim,amountPaidOnContentsClaim,reportedZipcode
count,19943.0,19943.0,4939.0,19669.0,16442.0,20000.0
mean,29.779968,-95.448202,61.329824,66239.67,20042.823729,77040.5387
std,0.312829,0.487151,26.201233,90228.18,37712.538422,1090.962214
min,29.5,-149.8,-1.0,0.0,0.0,15342.0
25%,29.7,-95.5,49.0,2250.0,0.0,77033.0
50%,29.8,-95.5,59.0,32820.4,2311.275,77062.0
75%,29.8,-95.4,78.0,92348.82,25600.0,77084.0
max,61.6,-80.2,252.0,1436755.0,500000.0,99694.0


> So we can see the mean and standard deviation, as well as the mode (at the 50% level).

As we can see, this shows us the `mean` (that is, the average), and the standard deviation (which we'll describe later), as well as the range and percentiles.

* Working with categorical values

Now the above functions are good for summarizing numeric data.  Now let's talk about categorical data.  For example, with `reportedZipcode`, we may want to see what are some of the most frequent zipcodes.  We can do so, with the value_counts method.

In [8]:
claims_df['reportedZipcode'].value_counts()[:10]

77096    1881
77079    1039
77025     855
77024     849
77084     693
77089     682
77074     672
77035     536
77088     461
77070     455
Name: reportedZipcode, dtype: int64

> So we can see that the most frequent zipcode is 77096, occurring 1881 times.

Let's try this again with `floodZone`.

> We first select the `floodZone` series, and then we call the `value_counts` method on that series.

In [14]:
claims_df['floodZone'].value_counts()[:10]

X      10276
AE      8708
C        363
A        273
AOB       94
AO        75
B         45
A04       40
A05       30
A10       20
Name: floodZone, dtype: int64

> Note that this method is only available on a series, not a dataframe.

### Summary

In [None]:
In this lesson