# Notebooks are made of cells

I'll give you my philosophy of notebooks later, but notebooks are designed to be read. As such there are two kinds of cells. Markdown cells and python cells. Let's create a new cell below this one by hitting esc-b

In [None]:
# it's a python cell
print("run me by hitting ctrl-enter")

Markdown cells also get run, i.e. rendered. Run me and go the next cell by hitting shift-enter

In [None]:
x = 5

WEIRD FACT! Notebook state is GLOBAL. Create a cell above me by hitting esc-a and define a value in it. Then create a cell below me and print it.

In [1]:
print(x)

NameError: name 'x' is not defined

It gets worse! Depending on which cells have been executed and in what order, global state can be pretty unpredictable

# Pandas and Matplotlib Tutorial

In this notebook, we will learn how to use pandas to load and manipulate some simple data, and use matplotlib to visualize the contents of a CSV file.

## Libraries

- **Pandas**: A software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Note that this is always imported as `pd`.
- **Matplotlib**: A plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. Note that we actually want to import `matplotlib.pyplot` and this is always imported as `plt`.


In [None]:
# Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Pandas

## Reading a CSV File using Pandas
CSV stands for comma separate value. A CSV file is a way of writing down the _contents_ of a spreadsheet in a way that can be read in any language.


Now that we have a CSV file, we can read it using pandas. The `read_csv` function in pandas is used to read CSV file into a new `DataFrame`. Sadly, all dataframes are called `df`. Always give your variables good names. Quick, give me a high level description of the contents of `df`. Now do it again after not having looked at this notebook in six months.

Once we have our dataframe, we will call the `head`, `info` and `describe` functions to make sure data has the right shape and looks reasonable statistically. `info` actually prints, rather than returns.

You can get the column names using the `columns` property. (There are also `tail` and `sample` functions similar to the `head` function.)

Notice that jupyter returns (and prints) the last command in a cell. We want to see the head, so we must print it. Since describe() is last, it will print automatically.
.


In [None]:
# Reading CSV file
df = pd.read_csv('samples.csv')
print("head\n", df.head())
print("\n\ninfo\n")
df.info()
print("\n\ncolumns\n", df.columns)
print("\n\ndescribe")
df.describe()

## Manipulating the data

In this section we we will select subsets of the data, as well as sort, group and transform it.

### Selecting subsets of the data

In this section we will select rows of the data by index and by value, as well as columns. Note that these methods use `[]` like arrays not `()` like functions.

In [None]:
# First we select columns by name or by a list of names
print(df["sine"].head())
df[["normal1","normal2","exp"]].head()

In [None]:
# We can select rows by row number (index) using iloc
print(df["sine"].iloc[3])
df.iloc[5:11]

In [None]:
# Next we select rows by value using loc. Note that the geniuses who wrote pandas decided to use the c-style bitwise & and | operators 
# to mean and and or. (I assume they did that to look like R, but still). Note that it gives me the values and their indices.
print(df.loc[df["fruit"] == "cherries"]["fruit"].count())
df.loc[(df["normal1"] >= .99) & (df["normal1"] <= 1.01)] 

### Sorting, grouping and transforming the data

In this selection, we describe how to sort, group / aggregate and transform the data.

In [None]:
# sort using the sort_values function
print(df.sort_values("sine").head())
print("\n")
print(df.sort_values(["fruit", "sine"]).head())

In [None]:
# aggregating is more complicated than this example makes it appear, but you can get really far w/out running
# into anything complicated
grouped = df.groupby('fruit').mean()
grouped

In [None]:
# Finally we demonstrate creating a new column from an existing column using the map function
# Note that map takes a _function_ not a function _call_
df['range'] = df['sine'].map(np.arcsin)
df.head()

## Visualizing the data

In this section we will use `matplotlib` to demonstrate some simple plots and charts

### Creating a Line Plot

The first thing we are going to do is draw a simple line plot of our sine curve. Notice that we have to call plt.show() to render the plot. If we didn't, since Jupyter prints the last thing returned, we would still get the plot, but also additional information we don't want.


In [None]:
series = np.arange(0, 100, 0.1)
plt.plot(series, np.sin(series))
#plt.show()

### Creating a Scatterplot

Next we will create a scatter plot to visualize the relationship between two random variables. We do not expect to see much correlation between a normal and an exponential curve, but we whould see a lot of correlation between two normal curves.

In [None]:
# Creating a scatterplot
plt.scatter(df['normal1'], df['normal2'])
plt.title('Scatterplot of two normal distributions')
plt.xlabel('Normal 1')
plt.ylabel('Normal 2')
plt.show()

In [None]:
# Creating a scatterplot
plt.scatter(df['normal1'], df['exp'])
plt.title('Scatterplot of a normal distribution against an exponential distribution')
plt.xlabel('Normal')
plt.ylabel('Exponential')
plt.show()

### Histograms

Next we will create some simple histograms to visualize our curves and make sure they look like the distribution they were sampled from.

In [None]:
plt.hist(df['normal1'], bins=20, edgecolor="black")
plt.show()

In [None]:
plt.hist(df['exp'], bins=20, edgecolor="black")
plt.show()

### Charts

We can also use matplotlib to create simple charts. Note that in this example I am creating subplots. 

In [None]:
_, axs = plt.subplots(1, 2, figsize=(14, 7))

fruit_counts = df["fruit"].value_counts()
# Bar chart
axs[0].bar(fruit_counts.index, fruit_counts.values, color='skyblue')
axs[0].set_title('Fruit Bar Chart')
axs[0].set_xlabel('Category')
axs[0].set_ylabel('Frequency')

# Pie chart
axs[1].pie(fruit_counts.values, labels=fruit_counts.index, autopct='%1.1f%%', startangle=90)
axs[1].set_title('Fruit Pie Chart')

plt.show()