# Pandas: processing data
Before you begin, go over the [demo code of basic statistics and visualization](https://github.com/mgbarsky/cs_1503_basic_stats_demo), presented in class.

**TURNING IT IN:** Submit this Jupyter notebook on Canvas by the deadline. Don't submit any other files.

## So, what is pandas anyway?

`pandas` is a Python library for doing data analysis. 
pandas is a foundational part of using Python for machine Learning. Most if not all things that pandas does can be done with plain-jane Python, but, most of the time, pandas does them *faster* and *easier*. It's built on top of another extremely powerful third-party Python library called `numpy`.

Pandas has a powerful set of *structures* and *functions* that make working with large datasets simple. Once you learn these structures and functions, it becomes extremely easy to answer any question you want to ask with a given data set. 

pandas also interacts nicely with a bunch of other Python libraries and programs:

* Jupyter notebooks, that allow you to construct computational narratives with code, data, and text. Displaying dataframes (one of pandas' data structures) as an inline HTML table is one of the major interactions between Jupyter and pandas.
* [Matplotlib](https://matplotlib.org/) is a powerful graphing library for Python. Generating plots from dataframes is simple with matplotlib and pandas.
* pandas also integrates with scientific computing/machine learning Python libraries, like [SciKit](http://scikit-learn.org/stable/) and [SciPy](https://www.scipy.org/).

__Keep running all the code cells as you read__.

## Sample dataset
We use the original Titanic dataset, that describes the survival status of individual passengers on the Titanic.

The dataset is summarized in file `titanic.csv`. The `.csv` extension means that each tuple is on a separate line, and the values inside each tuple are comma-separated.  

In [None]:
file_name = "titanic.csv"

In [None]:
import pandas as pd

# this creates a pandas.DataFrame
data = pd.read_csv(file_name, index_col='PassengerId')

In [None]:
# Selecting rows
head = data[:10]

head  # if you leave an expression at the end of a cell, jupyter will "display" it automatically

#### Some dataset attributes
* Name - a string with person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
* Sex - a person's gender (in those ol' times when just 2 of them were allowed)
* Age - age in years, if available
* Sibsp - number of siblings on a ship
* Parch - number of parents on a ship
* Fare - ticket cost
* Embarked - port where the passenger embarked
     * C = Cherbourg; Q = Queenstown; S = Southampton

## Pandas basics

In [None]:
# table dimensions
print("len(data) = ", len(data))
print("data.shape = ", data.shape)

In [None]:
# select a single row - row 4
print(data.loc[4])

In [None]:
# select a single column.
ages = data["Age"] # alternatively: data.Age
print(ages[:10])  # prints first 10 rows of a single-column dataset

In [None]:
# select several columns and rows at once
# alternatively: data[["Fare","Pclass"]].loc[5:10]
data.loc[5:10, ("Fare", "Pclass")]

Pandas also have some basic data analysis tools. For one, you can quickly display statistical aggregates for each column using `.describe()`

In [None]:
data.describe()

In [None]:
# filters
print("Only male children")
mc = data[(data['Age'] < 18) & (data['Sex'] == 'male')]
mc.head()

```
```
<div style="background-color:yellow;">
    <h3>Task 1. Your turn:</h3>    
</div>



In [None]:
# select passengers number 13 and 666 - did they survive?

#<YOUR CODE>

In [None]:
# compute the overall survival rate (what fraction of passengers survived the shipwreck)

#<YOUR CODE>

## Missing values
Some columns contain __NaN__ values - this means that there is no data there. For example, passenger `#6` has unknown *Age*, some others have unknown *Fare*. To simplify the data analysis, we can replace missing values by using pandas `fillna` function.

_Note: we do this only for the purpose of this tutorial. In general, you think twice before you modify data like this._

In [None]:
# Age before replacement
data.loc[6]

In [None]:
data['Age'] = data['Age'].fillna(value=data['Age'].mean())


In [None]:
# Age after replacement - meaning?
data.loc[6]

```
```
<div style="background-color:yellow;">
    <h3>Task 2. Your turn:</h3>    
</div>


In [None]:
# Replace the missing values of the "Fare" column with the median of the fair value:
#<YOUR CODE>


## Basic statistics

We can compute all the basic statistics on the dataset. For example, we can compute a max of the Fare column like this:

In [None]:
# computes a max ticket price and find the passenger who paid the most
m = data["Fare"].max()
m

Let's locate a person (or persons) who paid this max price:

In [None]:
data.iloc[data['Fare'].idxmax()] 

The `iloc()` function provides a way to access specific rows and columns in a pandas DataFrame using integer-based indexing. So, `idxmax()` first returned the index row of the max Fare. and the we locate and retrieve data based on its position within the DataFrame.

```
```
<div style="background-color:yellow;">
    <h3>Task 3. Your turn:</h3>    
</div>


In [None]:
# your code: compute mean passenger age and the data about the oldest woman on the ship
# <YOUR CODE>

In [None]:
# who on average paid more for their ticket, men or women?

# mean_fare_men = <YOUR CODE>
# mean_fare_women = <YOUR CODE>

# print(mean_fare_men, mean_fare_women)

In [None]:
# who is more likely to survive: a child (<18 yo) or an adult?

# child_survival_rate = <YOUR CODE>
# adult_survival_rate = <YOUR CODE>

# print(child_survival_rate, adult_survival_rate)

More about Pandas:
[kaggle microlesson](https://www.kaggle.com/learn/pandas)

## Plots and matplotlib

Using python to visualize the data is covered by yet another library: `matplotlib`.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# ^-- this "magic" tells all future matplotlib plots to be drawn inside notebook and not in a separate window.

# line plot
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

In [None]:
# scatter-plot
plt.scatter([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25])

plt.show()  # show the first plot and begin drawing next one

In [None]:
# draw a scatter plot with custom markers and colors
plt.scatter([1, 1, 2, 3, 4, 4.5], [3, 2, 2, 5, 15, 24],
            c=["red", "blue", "orange", "green", "cyan", "gray"], marker="x")

# without .show(), several plots will be drawn on top of one another
plt.plot([0, 1, 2, 3, 4, 5], [0, 1, 4, 9, 16, 25], c="black")

# adding more sugar
plt.title("Conspiracy theory proven!!!")
plt.xlabel("Per capita alcohol consumption")
plt.ylabel("# of data scientists per 100,000")

# fun with correlations: http://bit.ly/1FcNnWF

In [None]:
# histogram - showing data density
plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 10])
plt.show()

plt.hist([0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4,
          4, 5, 5, 5, 6, 7, 7, 8, 9, 10], bins=5)

```
```
<div style="background-color:yellow;">
    <h3>Task 4. Your turn:</h3>    
</div>


In [None]:
# plot a histogram of age and a histogram of ticket fares on separate plots

# <YOUR CODE>


In [None]:
# Can you do that? find out if there is a way to draw a 2D histogram of age vs fare.
# <YOUR CODE>

In [None]:
# make a scatter plot of passenger age vs ticket fare

# <YOUR CODE>



In [None]:
# Can you do that? add separate colors for men and women
# <YOUR CODE>

More about Data visualization:
[kaggle microlesson](https://www.kaggle.com/learn/data-visualization)

### This is the end of the pandas-primer recitation. 

Copyright &copy; 2020 Marina Barsky.