# Level 2: Exploring Data

Before continuing on with level 2, make sure you have completed level 1; specifically, we'll be needing the `weather_data.csv` file generated in level 1. This contains weather data from 2013 to 2015, and we'll be exploring that data within this level.

We'll now continue our data project by exploring the treasure trove of data we collected in level 1. This is also an overlooked but important part of data science; it helps us catch errors that may not have come up in the process of obtaining the data. Additionally, it gives us some intuition for the data, which is helpful when it comes to modeling.

Let's get started by reading in the data that we scraped in the last level.

In [None]:
import pandas as pd

data = pd.read_csv('weather_data.csv')

Pandas is an extremely useful and critical tool for data manipulation and processing within Python; here, it simplified our life by allowing us to read in a CSV (Comma Separated Values) file with just one function call. Let's see what it read in by examining the first few rows.

In [None]:
data[:5]

Awesome, that's exactly what we were expecting! The above table is the first few rows of a *data frame*, which is simply a fancy name for spreadsheets; it just means that we know the different columns and that each column has only one type of data.

To explain the syntax of the last statement, let's consider the case of lists in Python. The same code would give us the first 5 elements of `data` was a list; the same principle applies with a data frame, but instead, we get the first 5 rows of `data`. A sometimes useful mental model is that a data frame is simply a list of lists where each element of the outer list is a list of values associated with one observation of data.

Continuing on the list analogy, let's try finding the `len` of a data frame:

In [None]:
len(data)

It turns out that the `len` of a data frame is simply the number of rows that it has; this is useful to check because we expected to have data on 1095 days because 1095 = 3 * 365. Here's another useful function:

In [None]:
data.shape

This function is able to tell us both the number of rows and columns at once, sweet! 1095, as we saw before, is the number of columns, and 14 is the number of columns.

Before going on, let's rename the columns to be lowercase and not contain spaces; this helps with other things down the line that will be pointed out and is also standard data science practice.

In [None]:
column_names = data.columns
new_column_names = []
for column_name in column_names:
    new_column_name = column_name.lower()
    new_column_name = new_column_name.replace(' ', '_')
    new_column_names.append(new_column_name)
data.columns = new_column_names
data[:5]

That did the trick! We can now start accessing particular rows and columns depending on what we want to accomplish.

In [None]:
data.iloc[0]

This gives us all of the values of the first row of data. Let's try the first column instead:

In [None]:
data.month

Great, we can now access whole regions of data using the appropriate syntax depending on whether we want rows or column. Let's now try programatically looking for subsets of the data. For example, say we only wanted the data that was recorded in the month of December.

In [None]:
december_data = data[data.month == 12]
december_data[:5]

We can chain these conditions to ensure that multiple conditions are met. Let's try extracting data from May 2015.

In [None]:
may_2015_data = data[(data.month == 5) & (data.year == 2015)]
may_2015_data[:5]

Now that we know how to explore the data, let's look at some techniques for summarizing the data in different columns. First up is the `dtypes` attribute.

In [None]:
data.dtypes

This function seems simple because it just prints out the type of each column; however, it's useful if you spot something you don't expect. For example, we expect that `precipitation`, `wind_speed`, `max_wind_speed`, and `max_gust_speed` are all numeric types, but they are currently `object` types. Let's see if we can figure out what's going on. The `unique` function will show us all of the unique values of a particular column.

In [None]:
data.precipitation.unique()

It looks like we found the problem! While most of the values are things we'd expect for a numeric column, we also have an odd one out: `'T'`. We can automatically convert the whole data frame to be numeric types using the handy `convert_objects` function. 

In [None]:
clean_data = data.convert_objects(convert_numeric = True)
print(clean_data.dtypes)
clean_data.precipitation.unique()

Great, it looks like the conversion did what we expected to for the data types, but it introduced this weird value of `nan`. We can drop the rows containing NAs out of the data frame using the `dropna` function.

In [None]:
clean_data = clean_data.dropna()

Once we have dropped out the rows with weird values, `pandas` still keeps the old row numbering, but we fixed that with the call to `reset_index`, which just calculates the index again. We can now check if any of the values are missing are still there now using a call to `isnull` (which counts `nan` as ture) and `any` which just tells us if a result is all trues. Let's try an example:

In [None]:
print(clean_data.precipitation[:5].isnull())
print(clean_data.precipitation[:5].isnull().any())
print(clean_data.precipitation.isnull().any())

Awesome, now that we have cleaned up our data, let's try using the `describe` function to get a better idea of what's going on in the data.

In [None]:
clean_data.describe().transpose()

(The `transpose` function just turns any data frame sideways; here, it was done for readability.)

The reason that `describe` is so cool is that we get summary statistics for every single column. Everything seems okay, so let's get started on plotting the data to see even more patterns.

We'll be using the matplotlib and Seaborn packages within Python to plot data; Seaborn is built on top of matplotlib but is specifically built for statistical visualizations, which is why we'll prefer to use it. However, some `matplotlib` functions will still be helpful. To start, we'll need the following few lines:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

These load the packages we need and also tell the notebook how to include our graphics. Great, let's do our first plot now:

In [None]:
sns.distplot(clean_data.mean_temperature)

This simple plot just visualizes the distribution of the average temperature across all the days we collected data for; specifically, it plots the histogram (the bars) and an estimate of the distribution (the line). We can also just plot the histogram.

In [None]:
sns.distplot(clean_data.mean_temperature, kde = False)

Neat! Let's add a title and some axis labels.

In [None]:
sns.distplot(clean_data.mean_temperature, kde = False)
sns.plt.title('Daily Average Temperature (2013 - 2015)')
sns.plt.xlabel('Temperature')
sns.plt.ylabel('Frequency')

That looks like a pretty fancy graph. Let's zoom in on a portion by setting the limits of the plot; we'll also change the bin size accordingly since we're looking at a portion of the plot.

In [None]:
sns.distplot(clean_data.mean_temperature, kde = False, bins = 40)
sns.plt.title('Zoomed In - Daily Average Temperature (2013 - 2015)')
sns.plt.xlabel('Temperature')
sns.plt.ylabel('Frequency')
sns.plt.xlim((30, 60))
sns.plt.ylim((0, 50))

These same functions that we've been using to edit the graph can be used more generally, but let's move on to move interesting graphs. Namely, let's try plotting the histograms of the average and maximum temperature on the same graph.

In [None]:
sns.distplot(clean_data.mean_temperature, kde = False)
sns.distplot(clean_data.max_temperature, kde = False)
sns.plt.title('Daily Average and Max Temperature (2013 - 2015)')
sns.plt.xlabel('Temperature')
sns.plt.ylabel('Frequency')

Whoa, cool plot alert! Let's add a legend to make sure someone looking at the plot knows which histogram is which.

In [None]:
sns.distplot(clean_data.mean_temperature, kde = False, label = "Average Temperature")
sns.distplot(clean_data.max_temperature, kde = False, label = "Max Temperature")
sns.plt.title('Daily Average and Max Temperature (2013 - 2015)')
sns.plt.xlabel('Temperature')
sns.plt.ylabel('Frequency')
sns.plt.legend()

We're getting pretty good at this. Let's try plotting a scatterplot to see the relationship between temperature and precipitation.

In [None]:
sns.plt.scatter(clean_data.mean_temperature, clean_data.precipitation)
sns.plt.title('Temperature vs Precipitation')
sns.plt.xlabel('Temperature')
sns.plt.ylabel('Precipitation')

This plot can help us think about the next step of modeling the data; it doesn't seem like temperature by itself will do a great job of predicting the amount of precipitation since there's a range of possible precipitation values for each temperature.

It'd be a hassle to do a scatterplot for every possible variable, but luckily, we can use the built in `pairplot` function. (We're only taking a few columns of the `clean_data` data frame though to keep things managable.)

In [None]:
sns.pairplot(data = clean_data[['mean_temperature', 'precipitation', 'dew_point', 'wind_speed']])

In this level, we looked at how to explore our data to make sure nothing's wrong with it and to start thinking about how to model precipitation. Once you're ready, we'll see you on the next level to start modeling the data.