# Python
## Week 3:  More pandas and matplotlib

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Jephian Lin</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

## 1. pandas:  DataFrame

Check the official [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/tutorials.html) and a simple version [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html) for more information.

In [None]:
import numpy as np
import pandas as pd

There are various ways to create a `DataFrame`.

Use a two-dimensional `ndarray`.

In [None]:
mtx = np.arange(15).reshape(3,5)
mtx

In [None]:
df = pd.DataFrame(mtx)
df

Use a list of `Series`'s.  
Each `Series` becomes a row.

In [None]:
s = pd.Series(np.arange(5))
df = pd.DataFrame([s,s * 100,s > 2])
df

Use a list of dictionaries.  
Each dictionary becomes a row.  
The keys of the dictionaries becomes the names of columns.

In [None]:
d1 = {'weight': 50, 'height': 150}
d2 = {'weight': 60, 'height': 160}
d3 = {'weight': 70, 'height': 170}
df = pd.DataFrame([d1,d2,d3])
df

Read from a `csv` file by `pd.read_csv(filename)`.

In [None]:
cat clean_data.csv ### Linux command for checking the content of a file

In [None]:
df = pd.read_csv('clean_data.csv')
df

### DataFrame index and columns
If `df` is a `DataFrame`,  
`df.index` is the indices of the rows, while  
`df.columns` is the names of the columns.

In [None]:
df = pd.read_csv('clean_data.csv')
df

In [None]:
df.index

In [None]:
df.columns

You may use `.rename(dict)` to rename the indices or the columns.

In [None]:
df.rename({1:100})

In [None]:
df.rename({'Final': 'Final Exam'}, axis=1)

`df.set_index(column name)` pick a column and set it as the index (row names).

In [None]:
df.set_index('Name')

### DataFrame selection
Use `.iloc[row index, column index]` or `.loc[row name, column name]` to **select** an entry.  

In [None]:
df = pd.read_csv('clean_data.csv')
df = df.set_index('Name')
df

In [None]:
print(df.iloc[0,2])
print(df.loc['Amy','HW3'])

Slicing allows you to select a sub-`DataFrame`.

In [None]:
df.loc[:,'HW1':'HW10']

By default,  
`df.[index name]` selects a column, and  
`df.loc[index name]` selects a row.

In [None]:
df['HW7']

In [None]:
df.loc['Chris']

The same methods allow you to  
**create a new row or a new column**.

In [None]:
df.sum(axis=1) ### this computes the semester total

In [None]:
df['total'] = df.sum(axis=1) ### this create a new column called total
df

In [None]:
np.average(df,axis=0) ### this compute the average for each component

In [None]:
df.loc['average'] = np.average(df,axis=0) ### this create a new row called average
df

Slicing with boolean array.

In [None]:
df[df <= 13]

**String method**  
`.str` contains many functions related to strings.

Make every string upper case.

In [None]:
df.index.str.upper()

Find all string that ends with something.

In [None]:
df.index.str.endswith('l')

Find all string that starts with something.

In [None]:
df.columns.str.startswith('HW')

Slicing with string method

In [None]:
df.loc[ df.index.str.endswith('l') ]

In [None]:
df.loc[:, df.columns.str.startswith('HW') ]

### Graphs of a DataFrame

A **line chart** is good for seeing the changes.  
`df.plot()` will plot the line chart for each column in `df`.

In [None]:
stock1 = 5 + 0.3 * np.arange(10) + np.random.randn(10)
stock2 = 10 - 0.1 * np.arange(10) + np.random.randn(10)
stock3 = 8 + 0.2 * np.arange(10) + 2 * np.random.randn(10)
df = pd.DataFrame({'stock1': stock1, 'stock2': stock2, 'stock3': stock3})
df

In [None]:
df.plot()

A **bar graph** is good for seeing the relations of different properties of an item.  
`df.plot.bar()` will produce a bar graph for each row.

In [None]:
weights = 50 + np.random.randint(-5,5,10)
heights = 2 * weights + 50 + np.random.randint(-5,5,10)
ages = np.random.randint(30,80,10)
df = pd.DataFrame({'weight': weights, 'height': heights, 'age': ages})
df

In [None]:
df.plot.bar()

A **scatter graph** is good for seeing the correlation between two properties.  
`df.plot.scatter(column name 1, column name 2)` will produce the scatter graph for these two columns.

In [None]:
df.plot.scatter('weight','height')

A **histogram** is good for seeing the distribution (the frequencies) of some data.  
`df.hist()` will draw the histogram for each column of `df`.

In [None]:
df.hist()

#### Exercise
Create a $3\times 4$ all-ones `DataFrame`.

In [None]:
### Your answer here
df = pd.DataFrame(np.ones(???))
df

#### Exercise
Try the following code to  
guess the meaning of the `cumsum()` function  
and to refresh yourself about _axis_.

In [None]:
df = pd.DataFrame(np.random.randint(5,size=(3,4)))
df

In [None]:
df.cumsum(axis=0)

In [None]:
df.cumsum(axis=1)

#### Exercise
`df` is a $3\times 4$ `DataFrame`.  
Change the names of the rows to `Day 1, ...,Day 3`  
and the names of the columns to `Price 1, ..., Price 4`.

In [None]:
df = pd.DataFrame(10 * np.random.randint(10,size=(3,4)))
### Your answer here

df

#### Exercise
Using the renamed `df` you obtained previously,  
get the entry corresponding to `Price 2` and `Day 2`.

In [None]:
### Your answer here


#### Exercise
Using the renamed `df` you obtained previously,  
get the sub-DataFrame of `df` without the first column and the first row.

In [None]:
### Your answer here


#### Exercise
Using the renamed `df` you obtained previously,  
get the row of `Day 2` in `df`. 

In [None]:
### Your answer here


#### Exercise
Using the renamed `df` you obtained previously,  
get the column of `Price 2` in `df`. 

In [None]:
### Your answer here


#### Exercise
Let `df` be the table from `clean_data.csv`  
indexed on the column of `'Name'`.  
Create a row called `max`  
that contains the maximum of each column.

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here

df

#### Exercise
Let `df` be the table from `clean_data.csv`  
indexed on the column of `'Name'`.  
Create a column called `HW total`  
that contains the total of all homework scores.

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here

df

#### Exercise
Do the previous exercise with the string method  
(if you didn't do it that way).

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here

df

#### Exercise
Let `df` be the table from `clean_data.csv`  
indexed on the column of `'Name'`.  

Apply `cumsum()` on axis 1 to see how student earn scores by time.   
Then do `df.T` to get the transpose of `df`.  
Then plot this resulting DataFrame.  

Do everything in one line.  

(The answer has been provided for you,  
 and you may play around with it.)

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here

df.cumsum(axis=1).T.plot()

#### Exercise
Let `df` be the table from `clean_data.csv`  
indexed on the column of `'Name'`.  

Select the columns of exams (the two midterms and the final).  
Then draw the bar graph for each students.

Do everything in one line.

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here



#### Exercise
Let `df` be the table from `clean_data.csv`  
indexed on the column of `'Name'`.  

Draw the scatter graph by the columns of `Midterm1` and `Midterm2`.

In [None]:
df = pd.read_csv('clean_data.csv').set_index('Name')
### Your answer here



## 2. matplotlib basics

[matplotlib](https://matplotlib.org/) is a plotting library.  
Using `matplotlib`, one may draw any graph from scratch.  
Indeed, many programs, including pandas, call matplotlib to produce pictures.  

But many statistical figures has routined work to do,  
so `matplotlib.pyplot` provides several convenient commands for drawing particular graphs  
such as the histogram, the bar graph, and so on.

See the official [matplotlib tutorial](https://matplotlib.org/tutorials/index.html) for more information.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Most of `matplotlib.pyplot` commands for statistics  
takes **numpy data** as the input.

`plt.plot(x,y)` connects the points in `zip(x,y)` by segments.

In [None]:
x = np.linspace(-3,3,10)
y = np.exp(-x**2)
plt.plot(x,y)

Make it smoother.

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
plt.plot(x,y)

`plt.scatter(x,y)` plots the points in `zip(x,y)`  
but no segments in between.

In [None]:
plt.scatter(x,y)

`plt.hist(vals)` separates data in `vals` in to `bin` categories

In [None]:
vals = np.random.randn(1000)
plt.hist(vals,bins=50)

`plt.bar(x,y)` draws a bar graph that has  
`x.size` bars at positions given by `x` and with heights given by `y`.

In [None]:
x = np.arange(5)
y = 5 - x
plt.bar(x,y)

#### Exercise
There are quite many parameters that you can adjust in `plt.plot()`.  
Run the next cell first and guess what does `'go-'` mean.  
Can you make it a blue, dashed line, with a pixel marker?

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
plt.plot(x,y,'go-')

In [None]:
### Your answer here

plt.plot?

#### Exercise
For the same graph above, make the `linewidth` as `5`.

In [None]:
### Your answer here

plt.plot?

#### Exercise
`plt.plot(x1, y1, setting1, x2, y2, setting2, ...)` allows to plot several line graphs together  
to compare their differences.  
Run the next cell first and then  
change `setting1` to make the first line graph with red, dotted line, and triangle_down marker, and  
change `setting2` to make the second line graph with green, dash-dot line, and triangel_up marker.

In [None]:
### Your answer here

plt.plot(x-1,y,x+1,y)

#### Exercise
The **legend** tells you the name of each line graph.  
You may give a name to the line graph  
by setting the `label` in `plt.plot()`.  

Since no names are given to the line graphs,  
it shows a warning.

Label the first line graph as `line1` and  
label the second line graph as `line2`.

In [None]:
plt.plot(x-1,y)
plt.plot(x+1,y)
plt.legend()

#### Exercise
For scatter graph, you have control to the size of each point.  
Assign a list to `s` to plot the scatter graph  
so that the points are getting larger from left to right.

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
list_of_sizes = np.arange(60)

### Your answer here
plt.scatter(x, y, s=???)


#### Exercise
For scatter graph, you have also control to the color of each point.  
Assign a list to `c` and assign a colormap to `cmap`  
to plot the scatter graph  
so that each point has a different color.

A **colormap** maps a number to a color.  
Possible colormaps are `'viridis'`, `'plasma'`, `'inferno'`, `'Greys'`, `'Reds'` and so on.  
See more [here](https://matplotlib.org/gallery/color/colormap_reference.html).

In [None]:
list_of_colors = np.arange(60)

### Your answer here
plt.scatter(x, y, c=???, cmap='???')


#### Exercise
Play around the parameters `bins` and `range` in `plt.histogram()`  
to see their effects.

In [None]:
vals = np.random.randn(1000)

plt.hist(vals, bins=50, range=(0,3))

#### Exercise
The difference between a histogram (直方圖) and a bar graph (長條圖)  
is that histograms are for continuous categories while  
bar graphs are for discrete categories.

Change the `width` of the second bar graph so that you can see both bar graphs.

In [None]:
numbers = np.arange(1,7)
h1 = numbers
h2 = 6 - numbers

plt.bar(numbers, h1)
plt.bar(numbers, h2)

#### Exercise
`sklearn` is a Python package that contains many tools for machine learning.  
It also contains some datasets for you to practice.  

The `iris` dataset is a famous one.  
Use `iris.keys()` to see what is contained in `iris`.  
Then read its description in `iris['DESCR']`.

In [None]:
import sklearn.datasets
iris = sklearn.datasets.load_iris()
type(iris)

In [None]:
iris.keys()

In [None]:
print(iris['DESCR'])

#### Exercise
Now you know `iris['data']` contains four features.  
Let `iris_data = iris['data']`.  

Make a scatter plot using `iris_data[:,0]` and `iris_data[:,1]`.  

These are the $0$-th and the $1$-st columns of `iris_data`,  
and they record the _sepal length_ and the _sepal width_ of each sample.

In [None]:
iris_data = iris['data']

### Your answer here


#### Exercise
You also noticed that there are three species of iris  
and `iris['target']` is an array of `0`, `1`, and `2`  
according to its species.

Make the same scatter plot as previously  
but set `c` as `iris['target']` and set `cmap` as `'viridis'`.

If now I give you the sepal width and the sepal length of an iris flower,  
can you tell me which species it is?

In [None]:
iris_data = iris['data']

### Your answer here



## 3. more matplotlib
As mentioned, `matplotlib` allows you to draw almost everything.  
Knowing more fundamental ideas in `matplotlib`  
provides you more freedom to toggle with every details and create various pictures.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Use `plt.figure()` to create a **figure** object.  
A figure object is the whole canvas for a picture.

To draw something, you have to `add_axes()` first.

In [None]:
fig = plt.figure()
fig.add_axes([0,0,1,1])

`add_axes()` takes a list `[left, bottom, width, height]` as its input.  
Each parameters in the list is a **proportion** to the figure width and height.

Each coordinate system (x-axis and y-axis, usually) is called an **Axes** object.

In [None]:
fig = plt.figure()
fig.add_axes([0,0,1,1])

ax1 = fig.add_axes([0.1,0.2,0.8,0.2]) ### If you store the Axes object
ax1.set_title('ax1') ### then you can do something to it.

ax2 = fig.add_axes([0.2,0.6,0.2,0.2])
ax2.set_title('ax2')

ax3 = fig.add_axes([0.6,0.6,0.2,0.2])
ax3.set_title('ax3')

Each pictures generated previously are drawn on an **Axes** object.  
(For 2D graphs, each `Axes` object has its two **Axis** objects.)

When `ax` is an Axes object,  
you may use `ax.plot()` or so to draw on the Axes object.  

`plt.plot()` will plot on the last Axes object.

In [None]:
fig = plt.figure()
fig.add_axes([0,0,1,1])

ax1 = fig.add_axes([0.1,0.2,0.8,0.2]) ### If you store the Axes object
ax1.set_title('ax1') ### then you can do something to it.
x = np.linspace(-5,5,60)
y = np.cos(x)
ax1.plot(x,y) 

ax2 = fig.add_axes([0.2,0.6,0.2,0.2])
ax2.set_title('ax2')
t = np.linspace(0,2*np.pi,20)
x = np.cos(t)
y = np.sin(t)
ax2.scatter(x,y)

ax3 = fig.add_axes([0.6,0.6,0.2,0.2])
ax3.set_title('ax3')
ax3.scatter(x,y)

plt.scatter([0],[0],s=[300]) ### s is the size

### `subplots()` and `subplot()`

`fig.subplots(m,n)` creates `m * n` Axes objects  
and return them as a list `axs`  
where `axs[i,j]` is the Axes on the `i`-th row and the `j`-th column.

In [None]:
fig = plt.figure()
axs = fig.subplots(3,3)
### the previous two lines can be combined as 
### fig, axs = plt.subplots(3,3)

x = [None, None, None]
x[0] = np.array([1,2,3])
x[1] = np.array([2,2,2])
x[2] = np.array([3,2,1])

for i in range(3):
    for j in range(3):
        axs[i,j].scatter(x[i],x[2-j])

Alternatively, you may create **only** one subplot  
by `plt.subplot(m,n,k)`.  
This function will create an Axes object at the `k`-th position  
among the `m * n` grid.

The index of the positions is from `1` to `m * n`,  
increasing by the row-major order.

In [None]:
fig = plt.figure()
ax1 = plt.subplot(3,4,4)
ax1.set_title('4-th position in the 3 * 4 grid')

ax2 = plt.subplot(2,2,3)
ax2.set_title('3-rd position in the 2 * 2 grid')

main_ax = fig.add_axes([0,0,1,1])
main_ax.set_zorder(-1) ### zorder put the main_ax at the back

### Selecting Axes objects

`fig.axes` returns a list of all Axes objects on `fig`.  
You may use it to select the desired Axes object.

In [None]:
fig = plt.figure()
plt.subplot(2,2,2)
plt.subplot(2,2,3)
axs = fig.axes
print(axs)
axs[0].scatter([1,2],[1,2])

`matplotlib` keeps track of the **current Figure** and the **current Axes**.  
Commands like `plt.plot()` without specifying the Figure and the Axes  
will be drawn on the current Figure and the current Axes.  

Use `plt.gcf()` and `plt.gca()` to get them.

In [None]:
plt.figure()
fig = plt.gcf()
fig.patch.set_color('lightblue') 

fig.add_axes([0.2,0.2,0.6,0.6])
ax = plt.gca()
ax.bar([3,2,1],[1,2,3])

### Setting the Axes object

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
plt.plot(x,y)

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
plt.plot(x,y)

ax = plt.gca()
ax.set_title('A bell-shaped distribution')
ax.set_xlabel('X value')
ax.set_ylabel('Frequency')
ax.set_xlim(-10,10)
ax.set_xticks([-3,3])
ax.set_yticks([0,0.4,0.8])
ax.set_yticklabels([0,4,8])

#### Exercise
Create a figure and create at least three Axes objects on it.  
Give a title to each of the Axes objects.

In [None]:
### Your answer here 
fig = plt.figure()


#### Exercise
Copy your code in the previous exercise.  
Now make a plot in each of the Axes objects.

In [None]:
### Your answer here



#### Exercise
Create a figure and create at least three Axes objects on it  
using the `plt.subplot()` method.

What happens if your Axes objects overlap?

In [None]:
### Your answere here
fig = plt.figure()



#### Exercise
Two datasets `x[0]` and `x[1]` are given.  
Create a figure with four `subplots` and  
for the Axes object at the $i,j$ position  
make the scatter plot for `x[i]` and `x[j]`.

In [None]:
x = [0,0]
x[0] = np.random.randn(1000)
x[1] = np.random.rand(1000)

### Your answer here 



#### Exercise
Play around the settings on each axis.

In [None]:
x = np.linspace(-3,3,60)
y = np.exp(-x**2)
plt.plot(x,y)

ax = plt.gca()
ax.set_title('A bell-shaped distribution')
ax.set_xlabel('X value')
ax.set_ylabel('Frequency')
ax.set_xlim(-10,10)
ax.set_xticks([-3,3])
ax.set_yticks([0,0.4,0.8])
ax.set_yticklabels([0,4,8])

## Homework

**Before** you do the homework  
Click on **Cell > All Output > Clear**  
to clear all previous output.  

This will decrease the size of file.

**Remember to import the necessary packages**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#### Problem 1
The file `clean_data.csv` was obtained by cleaning up the file `scores.csv`  
and it requires a few steps.  

Recall how we input the data from `clean_data.csv`.  

If you do exactly the same thing,  
then you run into some errors.  

Read the first few lines of `scores.csv`  
and adjust the `skiprows` parameter in `pd.read_csv()` if necessary.

In [None]:
cat clean_data.csv

In [None]:
clean_df = pd.read_csv('clean_data.csv')
clean_df

In [None]:
cat scores.csv ### sooooo ugly

In [None]:
### Your answere here
df = pd.read_csv('scores.csv') ### do something to make it work
df

#### Problem 2
Now you have successfully imported the data from `scores.csv`,  
but it is still very ugly.

Firstly, `set_index` to the column of `'Name'`.  
Secondly, select only those columns `startwith` `'grade'`.  
(The resulting `DataFrame` should only have 13 columns,  
and it is indexed by `'Name'`.

In [None]:
### Your answer here



#### Problem 3
Let `small_df` be the resulting `DataFrame` in Problem 2.  
Now `rename` every column from `'grade: 2018FMath555/itemname'` to `'itemname'`.

For example, `'grade: 2018FMath555/HW1'` should just be `'HW1'`,  
`'grade: 2018FMath555/Midterm1'` should just be `'Midterm1'`, and  
`'grade: 2018FMath555/Final'` should just be `'Final'`.

Now your last `DataFrame` should be the same as imported from `clean_data.csv`.

In [None]:
### Your answer here




#### Problem 4
Now a figure has been plotted for you.  

Get the currect Axes object by `plt.gca()`.  
Then `set_xlabel` to `'sepal length (cm)'`  
and `set_ylabel` to `'sepal width (cm)'`.

In [None]:
import sklearn.datasets
iris = sklearn.datasets.load_iris()

iris_data = iris['data']
plt.scatter(iris_data[:,0], iris_data[:,1], c=iris['target'], cmap='viridis')

### Your answer here




#### Problem 5
In `iris['data']`, it records four features of each iris sample.  
The four features are in `iris['feature_names']`.  

Make a figure with 16 `subplots`.  
Then for the $i,j$-the subplot,  
make the same scatter plot as in the previous problem  
but with the `i`-th column and the `j`-th column.

If possible, `set_xlabel` and `set_ylabel` for them in a nice way.
(See the example at the end.)

In [None]:
import sklearn.datasets
iris = sklearn.datasets.load_iris()

iris['feature_names']

In [None]:
fig, axs = plt.subplots(4,4,figsize=[10,10])

### Your answer here




Here is an example for Problem 5.

<img src="iris_comparison.png" width="60%">