# Baseball First Look

## Learning Objectives

Take an initial look at a data set:
    * open csv files
    * conduct indexing and boolean slicing
    * show summary data views
    * basic plotting

## Imports

Import the pandas package. Pandas is a work horse of the Python analytic community.

In [None]:
import pandas as pd

Initialize inline plotting with the iPython magic command shown below.

In [None]:
%matplotlib inline

## Get Data

In [None]:
df = pd.read_csv('../data/baseball_data.csv')

## Look at Top 5 Rows

Notes :
    * the target "salary_in_thousands_of_dollars" is a positive continuous variable.
    * the last four columns are 1, 0 indicators
    * the only defensive variable is "number_of_errors"

In [None]:
df.head()

## View Data Types

In [None]:
df.dtypes

## Slicing Basics

There are three major slicing methods, ([ ... ], .loc[ ... ], .iloc[ ... ]). 

#### The [ ... ] operator

Below we demonstrate the implicit nature of the [ ... ] operator and its sometimes unintuitive behavior. It is best to use the [ ... ] operator for cases (1) and (3). Case (2) is best conducted with iloc[ ... ].

(1) Using the [ ... ] operators will allow pandas to infer what is being requested. Passing a column name or list of column names to this operator will return a Series or DataFrame containing the column or columns requested.

In [None]:
# returns a Series
df['batting_average']

In [None]:
# returns a DataFrame
df[['batting_average', 'on_base_percentage']]

(2) Handing in a list of ints returns the columns at those locations. Notice the use of 0 based indexing.

In [None]:
# returns the first four columns
df[[1,2,3,4]]

(3) Below we create a Series of True and False values. Then we pass the Series to df[ ... ] and slice rows instead of columns.

In [None]:
# returns the first 5 rows
boolean_series = df.index < 5 
df[boolean_series]

#### The .loc[ ... ] oprator

The .loc[ ... ] operator provides row, columns access simultaniously using the syntax df.loc[some_row_indexer, some_column_indexer]. The .loc[ ... ] method also provides access to the members of a DataFrame for the puposes of overwriting data.

In [None]:
# first 3 rows and all columns, notice the :2 and : syntax used to slice.
# also noe the 0 indexing and inclusive range (index 2 is included)
df.loc[:2, :]

#### The .iloc[ ... ] operator

The .iloc[ ... ] operator works like .loc[ ... ], but where as .loc[ ... ] references indexes and columns by name, .iloc[ ... ] references by position.

In [None]:
# sort df on the batting_average column
sortvals = df.sort_values('batting_average')

In [None]:
# notice the index names sort with the rows
sortvals.head()

In [None]:
# using .loc returns all of the rows until index 147 is reached
sortvals.loc[:147, :]

In [None]:
# returns the first 10 rows with no consideration of the index names
sortvals.iloc[:10, :]

## Use Boolean Indexing to Count the Number of NaNs

Using .isnull() and .any(axis=1) methods chained to a DataFrame will return a series indicating which rows have NULLs

In [None]:
df.isnull().any(axis=1)

Get a count of nulls using the fact that summing booleans treats True as 1 and False as 0

In [None]:
df.isnull().any(axis=1).sum()

Store the boolean series to a variable

In [None]:
bix = df.isnull().any(axis=1)

Subset the dataframe by handing in the boolean series to view the rows with NaNs. Any row where the series is True is kept.

In [None]:
df[bix]

## Use Boolean Indexing to Remove NaN Rows

Now we use the .notnull() method to find the non NULL rows and store a boolean series.

In [None]:
bix = df.notnull().all(axis=1)

This returns the dataframe without the null rows

In [None]:
df[bix]

Store the new dataframe to a name.

In [None]:
subset = df[bix]

## Look at Summary Stats

Note:
    * these variables are on very different scales
    * all variables are positive
    * consider the individual variables distributions
        - Salary for instance is heavily skewed
        - This can be seen by comparing the mean and median

In [None]:
subset.describe()

## Review Pairwise Correlations

Look for correlations with the target. For those things which are highly correlated with the target look at other highly correlated variables. Be aware of these relationships they can cause problems with the regression model.

In [None]:
subset.corr()

## Review Histogram of the Target

Above we set the "%matplotlib inline" option, it allows us to view plots in the notebook.

In [None]:
subset.salary_in_thousands_of_dollars.hist(bins=50)

## Review KDE of on_base_percentage

In [None]:
subset.number_of_stolen_bases.plot(kind='kde')

## Scatter Plots

In [None]:
subset.plot(y='salary_in_thousands_of_dollars', x='number_of_runs', kind='scatter')

In [None]:
subset.plot(y='salary_in_thousands_of_dollars', x='number_of_runs_batted_in', kind='scatter')

## Scatter Matrix

In [None]:
pd.scatter_matrix(subset.iloc[:,:-4], alpha=0.2, figsize=(25, 25), diagonal='kde');