# Data Science 

According to [Wikipedia](https://en.wikipedia.org/wiki/Data_science): 

> Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Bringing data to life with graphs and analysis is what makes Jupyter so special. In this lession you'll get an introduction to `Pandas` a library for importing, processing and graphing data. 

> Get started with Pandas using the Pandas tutorials:
>
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html

## Tables 

Data organized into tables (or *tabular* data) is a convenient and powerful way to represent information about a group of related items. Tables consist of rows that each represent one *entity* and columns that are *attributes* of the entity. 

For example, let's consider the MLB Player dataset from a previous lesson:

In [None]:
import pandas 
df = pandas.read_csv('files/mlb_players.csv', index_col='Name')
df

In the example above each row represents one player and the columns are the pieces of information about that player. Columns have a data type, just like variables. You can have as many columns and rows as you like, within the limit of the computer's memory. 

## The DataFrame 

The heart of the Pandas library is the DataFrame. The DataFrame represents a table and gives you access to the algorithms you learned in the last lesson without having to write the `for` loops yourself. The algorithms in Pandas are highly optimized and written in the C programming language so they run faster than anything you could implement yourself in Python. 

![Data frame](images/01_table_dataframe.svg)

But the algorithms you learned don't work on tables, they work on lists. Getting a list, called a `Series`, from a table is simple: Just select the **column** you want to turn into a list. 

![Series](images/01_table_series.svg)

For example, let's get the height of every player:

```python
df['Height']
```

Try the example:

In order to stay fast, Pandas only supports certain algorithms, like taking the sum of a column. You can write your own with a `for` loop but that will be much slower.  

## Mapping 

In Pandas the mapping operations work on a `Series` and create a new `Series`. Most of the time you want to add the new series to the original `DataFrame` so that you have a *derived column*. Derived columns can make your data easier to work with.

![New column](images/05_newcolumn_1.svg)

The `DataFrame` and `Series` support all of Python's basic operators. That makes it easy to do mapping of one or more columns. For example to translate the weight of each player from pounds to kilograms:

```python
df['Weight'] / 2.205
```

Try it:

If you want to create a new column for the mapped series, it's easy: 

```python
df['Weight (kg)'] = df['Weight'] / 2.205
df['Height (m)'] = df['Height'] / 39.37
df
```

Now the `DataFrame` has additional columns for metric height and weight.

Mapping functions can use more than one column in the calculation! For example, you can calculate the Body Mass Index (BMI) of each of the players. The BMI is defined to be:

$$BMI = kg / m^2$$

To calculate the BMI:

```python 
df['BMI'] = df['Weight (kg)'] / (df['Height (m)'] ** 2)
df
```

Try adding the BMI column:

Only basic math and logic funcitons work like you would expect when you perform a mapping function. If you want a mapping that's more complicated, for example one that uses an f-string, you should create a function that takes a column or columns as arguments and returns the result.  

For example:

```python 
def feet_inches(inches):
    """Return a string in feet and inches from inches."""
    return f'''{inches // 12}'{inches % 12}"'''

df['Height'].apply(feet_inches)
```

## Filtering 

Filtering reduces the rows in a `DataFrame` or `Series` to just the ones of interest. 

![Filtering rows](images/03_subset_rows.svg)

Filtering works on either a `Series` or a `DataFrame`. When filtering a `Series` the algorithm works just like the filters you implemented. For example if you wanted to just show player heights over 80 inches:

```python 
df[df['Height'] > 80]
```

Try the example:

**Look strange?** The Pandas library takes advantage of advanced features in Python. That sometimes makes it hard to read. In the example above the square brackets are the *index* operator. Inside of them there's a filtering expression that creates a new series of `True` or `False`. See for yourself: 

```python 
df['Height'] > 80
```

Try it:

**Awesome!**

## Reduction 

Reduction generates a single value from a `Series`. 

![Aggregation](images/06_aggregate.svg) 

Pandas supports many reduction operations. Here are examples:

| Reduction function | Example | Description | 
| ---- | --- | --- | 
| `sum()` | `df['Height'].sum()` | Return the sum of all values. | 
| `median()` | `df['Height'].median()` | Return the median of values. | 
| `min()` | `df['Height'].min()` | Return the minimum value. | 
| `max()` | `df['Height'].min()` | Return the maximum value. | 
| `len()` | `len(df['Height'])` | Return the number of values in the series. | 

Try the reduction functions in the next cell:


If you're taking or are going to take statistics here's a great function for you. The `describe` function computes summary statistics on all numerical columns:

```python
df.describe()
```

Try it:

## Plotting 

A picture is worth 1,000 words! Pandas makes it easy to plot the data in a `Series` or multiple series in a `DataFrame`. There are also many kinds of plots available (too many to cover here). 

![Plotting](images/04_plot_overview.svg) 

The data we have doesn't really have an X-axis so we'll start with a density plot. A histogram shows us how many people fall into "bins" defined by a range of a certain attribute. Histograms of measures like BMI usually result in a *bell curve*. 

To make a histogram:

```python 
df['BMI'].plot.hist()
```

Is the plot a bell curve?

## Complex Filters 

Sometimes you need to filter based on complex criteria. This is where Pandas can get a little weird. It's important to remember that the filtering algorithm is done by creating a `Series` that contains only `True` and `False` values.

The MLB player data set has `Starting Pitcher` and `Relief Pitcher` as positions. If you want to show all pitchers you might be tempted to use the `endswith()` function, but you can't because `endswith` is a function of `str`, not a function of `Series`.

```python 
# ERROR: endswith() isn't a function of a Series
pitchers = df['Position'].endswith("Pitcher")
```

The `in` operator doesn't work either:

```python
# ERROR: The in operator doesn't work on a Series
pitchers = "Pitcher" in df["Position"]
```

We can use the logical or but the `or` operator doesn't work as expected:

```python
# ERROR: The or operator doesn't work on a Series:
pitchers = (df["Position"] == "Starting Pitcher") or (df["Position"] == "Relief Pitcher") 
```

In oder to do a logical `or` we have to use Python's bitwise or operator: `|`:

```python
pitchers = (df["Position"] == "Starting Pitcher") | (df["Position"] == "Relief Pitcher")
df[pitchers]
```

**The parenthesis are mandatory because `==` has a lower precedence than `|`.**. It's pretty complex syntax to remember, but there's a better way.

The simplest way to do a complex filter is to use the `apply()` function. Since `apply()` works on a series and produces a series it's possible to write a function that returns `True` or `False` to create a series used for filtering. The `is_pitcher` function takes a cell value as an argument. The `Position` column is made of `str` so apply assumes that it's argument is of type `str`:

```python
def is_pitcher(position):
    return "Pitcher" in position

pitchers = df["Position"].apply(is_pitcher)
df[pitchers]
```

If we want to see *non-pitchers* we can reverse the `pitchers` Series using the bitwise `not` operator, `~`. Here are all the non-pitchers:

```python
df[~pitchers]
```


The `apply()` function also works on a `DataFrame`. The `axis=1` argument says to run `tall_pitcher` once for each row, rather than the default for once for each column. Using `apply` on a `DataFrame` enables filtering on multiple columns:

```python
def tall_pitcher(row):
    return "Pitcher" in row['Position'] and row['Height'] > 78

df[df.apply(tall_pitcher, axis=1)]
```

Try it:


One of the stats doesn't make a bell curve! What happens if we plot pitchers separately from other players?

```python
display(df[pitchers]['Height'].plot.hist())
display(df[~pitchers]['Height'].plot.hist())
```

Pandas is huge and rich and there's a enough to learn to fill a whole 16-week class.