# Data Understanding and Visualization
#### with Pandas, NumPy, and Matplotlib

- Pandas
    - Powerful data manipulation tool. Built for speed and efficiency.
    - Lets you handle data with columns and rows.
- NumPy
    - Extremely fast mathematical library. Most scientific computing python packages are built ontop of NumPy
    - Super fast N-Dimensional Arrays
- Matplotlib
    - Graphical visualizations of array data, either in pandas, numpy, or list format.
    - Third party package seaborn offers more appealing API and color schemes.

In this notebook, we'll install all of the packages at once, then run through NumPy, Pandas, and Matplotlib sections. You'll need to fill in some code, marked by #TODO.

# Installation

Run the following cell to install all four packages at once. If this does not work on your system, you can manually use the command line to install these packages.

In [None]:
!pip install numpy pandas matplotlib seaborn

# NumPy

In this section, we'll work wtih NumPy arrays and understand how how we can work with this data structure

In [None]:
import numpy as np

## N-Dimensional Arrays
Run the two cells below. Observe how the first cell is a plain python list. The output of the second cell is a np.array with the same values. There is no "shape" to the plain python list. However, we can call .shape on any np.array to get the shape of that array. (Remember, arrays are N-Dimensional!)

In [None]:
[1,2,3]

In [None]:
arr1 = np.array([1,2,3])
arr1

In [None]:
arr1.shape

Notice how a 1 dimensional array's shape has only one length followed by a trailing comma. This tells us that the first dimension's length is 3. Why is it not (3,1)? In the case of an array with shape (3,1), this would tell us it's a *2*-dimensional array with the first dimension's length is 3 and the second's is 1. This may seem like the same kind of array, but it is distinctly different -- especially when we get to Pandas!

In [None]:
arr2 = np.array([
    [1,2,3],
    [4,5,6]
])
arr2

In [None]:
#TODO: Write a line of code below to output the shape of the arr2 array. Expected output is (2,3)


In [None]:
#TODO: Write some code to create an np.array that has the shape (3,3).


## Utility properties
Using arr1 and arr2 from the previous cells, let's explore some utility properties that numpy offers.

In [None]:
# Finding the number of dimensions
arr1.ndim

In [None]:
arr2.ndim

Notice that the .ndim is the same if we took the length of shape:

In [None]:
len(arr1.shape)

In [None]:
len(arr2.shape)

In [None]:
# Finding what kind of number is stored
arr1.dtype

Notice that this means we can't have floats and ints in the same array. If we try to do that, our ints will be converted into floats.

In [None]:
np.array([1.4, 1])

# Pandas

In [None]:
import pandas as pd

Most of Pandas' strength comes from DataFrames, which are like 2 dimensional arrays. There is also Pandas Series, which are like 1-dimensional arrays. Unlike NumPy, they can store objects (such as strings)

In [None]:
animals = pd.Series(['dog','cat','frog','bee'])
animals

In [None]:
colors = pd.Series(['brown', 'black', 'green', 'yellow'])
colors

In [None]:
animal_df = pd.DataFrame({
    "animal": animals,
    "color": colors
})
animal_df

In [None]:
#TODO: Create your own dataframe! It can be anything, as long as it has at least 2 columns and 3 rows.


## Working with data!
In this section, we'll be using the two csv files from canvas:
- honeyproduction.csv
- honeyproduction_withnulls.csv

Download the files from canvas and place them in the same folder as this notebook.

In [None]:
honey_df = pd.read_csv("honeyproduction.csv")

Now let's take a look at the first five rows. We can use .head() to display the "head" 

In [None]:
honey_df.head()

### DataFrame utility properties!

In [None]:
# get the data type for each column.
honey_df.dtypes

Notice the similarity in how these are presented compared to NumPy dtypes!

In [None]:
# get a list-like object (Index) of the columns. This is an iterable object.
honey_df.columns

Now, try to get the shape of the DataFrame. Think of how shapes work in NumPy.

In [None]:
#TODO get shape of honey_df. Expected output: (626, 8)


We can also use `.describe()` to see some statistical information on the numerical columns.

In [None]:
honey_df.describe()

### Slicing DataFrames
Let's practice slicing the DataFrame to precicely access certain parts of the data.

In [None]:
# get the column with state information
honey_df['state']

In [None]:
# get the columns with state, priceperlb, and year information
honey_df[['state', 'priceperlb', 'year']]

**CHECKPOINT!**
Notice how the syntax is different for accessing one column compared to multiple. Also note how when accessing one column, we get a Series object. Contrast this to multiple columns, where we get a DataFrame that is a subset of `honey_df`

But what about rows?

In [None]:
# let's take a look at our DataFrame for reference.
honey_df.head()

In [None]:
# if we want to LOCate the row with the label 0:
honey_df.loc[0]

In [None]:
# if we want to locate the row based on position (not necessarily the labeled index)
honey_df.iloc[2]

In [None]:
# we can do ranges as well. they work like normal python slicing syntax
honey_df.iloc[1:4]

In [None]:
# just to prove the point, we can even get the first few even rows:
honey_df.iloc[0:11:2]

### Filtering DataFrames with conditionals

In [None]:
# filter the honey_df DataFrame so that we have all rows from NC
honey_df[honey_df['state'] == 'NC']

This syntax is confusing, but let's break it down:

Here's the column of `honey_df` that has state data

In [None]:
honey_df['state']

We can use a conditional to return a series of boolean values, which indicate whether that conditional was evaluated as true or false for a given row.

In [None]:
honey_df['state'] == 'NC'

Bringing it together, we get the rows of honey_df where our conditional was true!

In [None]:
honey_df[honey_df['state'] == 'NC']

In [None]:
#TODO write your own conditional to filter a subset of the DataFrame.
#It can be anything EXCEPT filtering by state. Hint: Try year!


## Working with missing data/nulls!
Now, real world data is never 'clean'. We should practice on some data that has null values. Let's re-assign `honey_df` to a DataFrame with data from `honeyproduction_withnulls.csv`

In [None]:
honey_df = pd.read_csv('honeyproduction_withnulls.csv')

In [None]:
honey_df.head()

We can see some nulls: specifically under totalprod and priceperlb. These are marked as NaN, as that's how NumPy handles non-existant data. Let's see how many nulls we have:

In [None]:
honey_df.isna().sum()

This returns the sum of the true values from `isna()`, which gives us a ccount of all nulls in our DataFrame, by column. For `totalprod`, let's just full the null values with the mean. This can work since there's very few nulls.

In [None]:
honey_df['totalprod'] = honey_df['totalprod'].fillna(
    honey_df['totalprod'].mean()
)

In [None]:
honey_df.isna().sum()

In [None]:
#TODO do the same for priceperlb


Now we're going to just drop the rows that have null values. If you're certain that you only have a few null datapoints and that the presence of null values have no impact on your dataset, this can be a good approach. Let's see the length of `honey_df` before and after as well.

In [None]:
len(honey_df)

In [None]:
honey_df = honey_df.dropna(axis=0)

In [None]:
len(honey_df)

Only dropping 11 rows is great for this dataset!

## Adding new columns

Let's say we need to add the price per ounce as well. Here's how we can add another column to our DataFrame.

In [None]:
honey_df['priceperoz'] = honey_df['priceperlb'] / 16

In [None]:
honey_df.head()

## Sampling data

If we need to take a sampling of our data, say 23%, we can do this easily.

In [None]:
sampled_df = honey_df.sample(frac=0.23)

In [None]:
len(sampled_df)

In [None]:
len(honey_df)

## Applying functions to a column

Let's say we need the price per gram. We could do this using a similar syntax to when we created priceperoz, but for the sake of exercise let's use `.apply()`

In [None]:
honey_df['priceperoz'].apply(lambda p: p * 28.35)

To add this to our DataFrame, we can just assign a column to this Series.

In [None]:
honey_df['pricepergram'] = honey_df['priceperoz'].apply(lambda p: p * 28.35)

In [None]:
honey_df.head()

# Matplotlib

In [None]:
import matplotlib.pyplot as plt

Matplotlib is a powerful and versatile tool for creating graphs. Follow along with the code below:

In [None]:
# an empty plot
plt.plot()

In [None]:
# plotting a list
plt.plot([1, 2, 3, 4])

In [None]:
# plotting two lists against eachother
x = [1, 2, 3, 4]
y = [5, 19, 12, 12]
fig, ax = plt.subplots()
ax.plot(x, y)

In [None]:
# a more professional looking plot, with saving to png!
# 1. prepare data
x = [1, 2, 3, 4]
y = [13, 53, 39, 2]

# 2. setup plot
fig, ax = plt.subplots(figsize=(20,10))

# 3. plotting data
ax.plot(x, y)

# 4. customize plot
ax.set(title="simple plot", xlabel="x axis", ylabel="yaxis")

# 5. save and show figure
fig.savefig('sample-plot.png')

## Working with DataFrames

In [None]:
# it's very easy to use .plot() on a DataFrame.
honey_df.plot(x='year', y='stocks', kind='scatter')

In [None]:
#TODO plot your own scatter plot on honey_df:


## Seaborn
We can use seaborn for prettier graphics

In [None]:
import seaborn as sns

In [None]:
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.lineplot(data=honey_df, x='year', y='stocks', hue='state')

Notice here that we need to pass in the data to `.lineplot()`

What happens if we want a scatterplot against all columns?

In [None]:
sns.pairplot(data=honey_df)

In [None]:
#TODO: Change this line plot to a scatter plot. (HINT: Change sns.lineplot)
sns.set(rc={"figure.figsize":(20,12)}) # change the figure size to width=20, height=12
sns.lineplot(data=honey_df, x='year', y='stocks', hue='state')