# Pandas introduction

Today we are going to start exploring datasets. For this we are going to use the `pandas` library.

In this section we will cover:

* [<mark>Exercise: Lambdas Starter</mark>](#lambda-starter)
* [Databases and Python](#databases)
* [Pandas overview](#overview)
* [Benefits of using pandas](#benefits)
* [Data wrangling](#wrangling)
* [Exploring a dataset](#exploring)
* [<mark>Exercise: Explore a dataset</mark>](#ex-explore-data)
* [Analysis](#analysis)
<a id = 'lambda-starter'></a>
## <mark>Exercise: Lambdas Starter</mark>

Today we will do a starter on **Lambda functions**. It will be useful to go over this as they are very helpful for use with Pandas.
1. Create a lambda function that adds 4 to a number and check it

2. Create a lambda function that multiplies two numbers together (and check it)

3. Create a lambda function to check if a number is bigger than 10 (and check it)

<a id = 'databases'></a>
# Databases and Python

Using Python to work with databases can be a bit messy.

In [None]:
with open('data/chickweight.csv', 'r') as file:
    type(file)
    data = file.read()

print(data[:100])

This is why we need Pandas!
<a id = 'overview'></a>
## Pandas overview

Pandas is a specialised package that allows us to work with databases using python.

First we need to import the package

In [None]:
import pandas as pd

Then, to read in a csv file we can use
```python
pd.read_csv('filepath/file.csv')
```

We are going to read in the `chickweight.csv` which is in our `data/` folder:

In [None]:
chickweight = pd.read_csv('data/chickweight.csv')

In [None]:
chickweight

Aside from making tables look prettier in your Jupyter notebook, there are many advantages for using the Pandas packages when working with data.

<a id = 'benefits'></a>

## The benefits of using Pandas (and Python)

- ***Groupby - Split-apply-combine*** - The ability to chunk your data set into pieces, apply a function, and place it back together is the number one reason to use Pandas. 
- ***Cleaning*** - Pandas allows you to automate the cleaning of your datasets.
- ***Merging dataframes*** - Gives you the full power of SQL for use with Python.
- ***Time series*** - Pandas is amazing at handling time series operations. Converting to different periods, resampling, etc… are a brilliant feature.
- ***Speed*** - When working with large datasets, it is much faster than tools like excel.
<a id = 'wrangling'></a>

## Data wrangling with Pandas

In this set of notebooks, we will explain how to use pandas for data wrangling.

***Data Wrangling*** - the process of transforming and mapping data, with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

However, eventually pandas will become a mere tool. Therefore in this notebook we will also practice doing a little bit of data analysis. 

To demonstrate Pandas' capabilities, we'll use the dataset about chickens.

In [None]:
chickweight = (
    pd.read_csv('data/chickweight.csv')
    .rename(str.lower, axis='columns')
)
chickweight.head()

<a id = 'exploring'></a>
## Exploring the dataset

In this section we'll highlight some tools for getting an initial understanding of the dataset

Using exploration includes:

* Checking the shape (`df.shape`) of the dataframe
* The length (`len(df)`) of the dataframe
* General information (`df.info()`) of the dataframe & columns
* Averages of each numeric column (`df.describe()`)
* The column names (`df.columns`)
* Fetching the first/last or a sample of a few rows (`.head()` `df.sample()` `df.tail()`)
* Selecting one (or more) columns (`df['column_name']`)
* Fetching the unique values of a column (`df['column_name'].unique()`)
* Summing the amount of unique values of a column (`df['column_name'].value_counts()`)

In [None]:
chickweight.shape

In [None]:
len(chickweight)

In [None]:
chickweight.info()

In [None]:
chickweight.describe()

In [None]:
chickweight.columns

In [None]:
chickweight.head()

In [None]:
chickweight.tail()

In [None]:
chickweight.sample(5)

In [None]:
chickweight.loc[[45]]

In [None]:
#anti-pattern

chickweight.diet

In [None]:
# better to select columns like this: 
chickweight['diet']

In [None]:
chickweight['diet'].unique()

In [None]:
chickweight['diet'].value_counts()

<a id = 'ex-explore-data'></a>
## <mark>Exercise: Explore a dataset</mark>
Investigate the `weight` and `time` columns of the dataframe.

Here are some questions you could answer:

1. How many different unique values for `time` are there? What do you think time represents in this dataframe?

2. What are the min & max of the time and weight column?

3. What is the most common weight of a chicken?

<a id = 'analysis'></a>
## Analysis

### What analysis could we do? 

<img src="images/chick.png" width="240" height="240" align="center"/>

Imagine that we are a farm and we have this dataset available.

Now we have a feel for the dataset, what would we do with it?

In [None]:
chickweight.head()

### Potential areas for analysis:

Some questions we might want to answer are:

The main use case we will want to focus on is to figure out which diet is best, but it is good to think about some of the other use cases. 