# Python Libraries Interactive Notebook

Today's workshop will focus on four fundamental python libraries: Numpy, Pandas, Matplotlib and Seaborn. Documentation and extra information about each is available on Google/Stack Overflow!
1. Numpy: https://docs.scipy.org/doc/numpy-1.16.1/user/
- Pandas:  http://pandas.pydata.org/pandas-docs/stable/
- Matplotlib: https://matplotlib.org/2.1.2/index.html
- Seaborn: https://seaborn.pydata.org/

# Table of Contents

I. [Numpy](#1)<br>
II. [Pandas](#2)<br>
III. [Plot with Pandas](#2.5)<br>
IV. [Matplotlib](#3)<br>
V. [Seaborn](#4)

### Jupyter Notebook Recap

`To run a cell: select cell, press SHIFT + ENTER`

Only the last line of a cell is displayed

In [None]:
"Hidden"
"Displayed"

`...` is space for your code!

To quickly `add a new cell`: make sure the current cell is highlighted in blue (click within the cell, but outside the text box) and `press A` to add a new cell above, or `press B` to add a new cell below.

To quickly `delete a cell`: select the cell (highlighted in blue again) and `press DD`

## Import

`numpy`, `pandas`, `matplotlib`, and `seaborn` need to be `import`-ed in order to use them

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# %matplotlib inline
# sns.set()

# <font id="1" color="blue">Numpy</font>

<tr><td>
<img src="http://rickizzo.com/images/posts/2017-12-19/numpy.jpeg"/></td>

<td style="text-align:left">Numpy's main use is ```np.array```
<br><br>
Numpy arrays take less space than built-in lists and come with a wide variety of useful functions.</td></tr>

In [None]:
# make an array
a = np.array([2,0,1,9])
a

In [None]:
# make a 2-dimensional array (matrix)
matrix = np.array([ [2,0,1,9],
                    [2,0,2,0],
                    [2,0,2,1] ])
matrix

Got to love Math 54 whoo!

In [None]:
# you can multiply matrices with np.dot
np.dot(matrix, a)

### Arithmetic with numpy!

**You can add/subtract/multiply/divide with numpy arrays!** You *cannot* do this with built-in python lists.

In [None]:
a + 1

In [None]:
a * -1

In [None]:
b = np.array([2, 0, 2, 0])
a + b

Operations can only be done on arrays of the same length, otherwise <span style="color:red">an error will occur.</span> Try running the cell below!

In [None]:
# Run me!
b + np.array([1, 9])

Use ```len(array)``` to find length of array.

In [None]:
len(b)

In [None]:
len(np.array([1, 2, 3, 4]))

Conditionals apply to every element of a numpy array as well.

In [None]:
a = np.array([2, 0, 1, 9, ])
a == 1

### Essential array functions

Why do we use Numpy? **Numpy provides a multitude of useful functions for arrays.** 

<font color="blue">Exercise:</font> *Research how to find the square root of a numpy array.*

In [None]:
x = np.array([2, 4, 16])

In [None]:
# Find the square root of array x
x_sqrt = ...
x_sqrt

There are a BUNCH of useful numpy functions. These are some of the most used, but you can find more by searching google / numpy documentation!

In [None]:
np.sum(x)

In [None]:
np.min(x)

In [None]:
np.max(x)

In [None]:
np.median(x)

In [None]:
np.cumsum(x)

In [None]:
np.abs(x)

What do you think ```np.cumsum``` does? 

What do you think ```np.diff``` does?

In [None]:
np.diff(x)

Two super useful functions in numpy are `np.arange` and `np.linspace`, which allow you to make arrays with equal distances between values:
* np.arange asks for [`start`], `stop`, and [`step`]
* np.linspace asks for `start`, `stop`, and `num`

In [None]:
np.arange(0, 100, 5)

In [None]:
np.linspace(0, 100, 10)

### Python

Using ```np.arrays``` in python is different than with built-in lists.

In [None]:
a = np.array([1, 2, 3])
b = [-1, -2, -3]
print(a)
print(b)

#### Adding values to np.array is different

In [None]:
b.append("hello")
b

In [None]:
a = np.append(a, 'hello')
a

#### For loops work the same way

In [None]:
c = np.array([1, 2, 3, 4, 5])
cumulative_product = 1

for element in c:
    cumulative_product *= element
    
cumulative_product

### <font color="blue">Numpy Exercises</font>

<font color="blue">Exercise:</font>  Use [`np.arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) to create an array called `arr` containing all multiples of three in 777.

In [None]:
arr = ...
arr

<font color="blue">Exercise:</font>  Use `arr` to create an array `arr2` of every odd number in `arr`. 

Hint: you can get certain values in an array using indexing syntax combined with a condition, i.e. `arr[arr == 10]` will give you an array of all the values in the original `arr` array that are equal to 10.

In [None]:
arr2 = ...
arr2

<font color="blue">Exercise:</font>  Create the same array as `arr2` using [`np.linspace`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) and call it `arr3`.

Note: It is okay if the values are floats instead of ints. We just want the same values.

In [None]:
arr3 = ...
arr3

<font color="blue">Exercise:</font>  Print the following statistics for `arr`: minimum, third quartile (see [`np.percentile()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html)), median, maximum, mean, and standard deviation.

In [None]:
print('Minimum: '                  + str(...))
print('Third quartile: '           + str(...))
print('Median: '                   + str(...))
print('Mean: '                     + str(...))
print('Maximum: '                  + str(...))
print('Standard deviation: '       + str(...))

# <span id="2" style="color: blue">Pandas</span>

<tr><td><img width=200 src="https://c402277.ssl.cf1.rackcdn.com/photos/13100/images/featured_story/BIC_128.png?1485963152"/></td><td>

Pandas is all about tables!</td></tr>

A table is called a ['dataframe'](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in Pandas. Consider the table `animals`:



<table border="1" class="dataframe">
  <thead><tr><td>animal</td><td>type</td></tr></thead>
<tr><td>shark</td><td>fish</td></tr>
<tr><td>hummingbird</td><td>bird</td></tr>
<tr><td>jellyfish</td><td>invertebrate</td></tr>
<tr><td>elephant</td><td>mammal</td></tr>
</table>

## Pandas Series

DataFrames consist of columns called [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series). Series act similarly to numpy arrays.

_How to make a Series:_

1.   create a numpy ```array```
2.   call ```pd.Series(array, name="...")``` &nbsp;&nbsp; <font color="gray"># name can be anything</font>

<font color="blue">Exercise:</font> Make a Series that contains the types of animals from `animals` and has the name `type`:


In [None]:
type_array = ...
type_column = ...
type_column

<font color="blue">Exercise:</font> Make another Series for the animal column:

In [None]:
animal_array = ...
animal_column = ...
animal_column

Combine your Series into a table!

`pd.concat([ series1, series2, series3, ... ], 1)`

Don't forget the ```1``` or you'll just make a giant Series.

In [None]:
animal_info = ...
animal_info

What if we were given the DataFrame and we want to extract the columns?

In [None]:
animal_info['animal'] # we get the animal_column Series back!

### Dictionaries

Also, we can manually create tables by using a [python dictionary](https://www.python-course.eu/dictionaries.php). A dictionary has the following format:

```
d = { "name of column"   :  [  list of values  ],
      "name of column 2" :  [  list of values  ],
                        ...
                        ...                       }```
    

In [None]:
d = { 'animal' : ['shark', 'hummingbird', 'jellyfish', 'elephant'],
      'type' : ['fish', 'bird', 'invertebrate', 'mammal']}

In [None]:
animal_info_again = pd.DataFrame(d)
animal_info_again

### Add Columns

Add a column to `table` labeled "new column" like so:

`table['new column'] = array`

In [None]:
animal_info['average weight (lbs)'] = np.array([2000, 1, 13, 9000])
animal_info

<font color="blue">Exercise:</font> Add a column called ```rating``` that assigns your rating from 1 to 5 for each animal :) 

In [None]:
...
animal_info  # should now include a rating column

### Drop

<font color="blue">Exercise:</font> Now, use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) the `type` column.

Hint: you must include axis=1

In [None]:
animal_info_without_type = ...
animal_info_without_type

## Lottery

Time to use a real dataset!

You can read a `.csv` file into pandas using `pd.read_csv( url )`.

Create a variable called `lottery` that loads this data: `https://data.ny.gov/api/views/5xaw-6ayf/rows.csv?accessType=DOWNLOAD`



In [None]:
lottery = pd.read_csv("https://data.ny.gov/api/views/5xaw-6ayf/rows.csv?accessType=DOWNLOAD")

Let's display the table. We can just type `hourly_precipitation` and run the cell but hourly precipitation is HUGE set! So, let's display just the first five rows with:

`DataFrame.head( # of rows )` - this will default to show the first 5 rows if left blank

In [None]:
lottery.head(5)

## Row, Column Selection

Follow the structure:

`table.loc[rows, columns]`

`table.loc[2:8, [ 'Name', 'Count']]`

The above code will select columns "Name" and "Count" from rows 2 **through** 8, inclusive

In [None]:
# Returns the name of our columns
lottery.columns

The [.loc[]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) function allows you to access rows and columns by their labels, and it is **inclusive** of both ends.

In [None]:
lottery.loc[2:8, ['Draw Date', "Winning Numbers"]]

<font color="blue">Exercise:</font> Return a table that includes rows 1000-1005 and only includes the column "Winning Numbers".

In [None]:
...

In [None]:
# Want to select EVERY row?
# Don't put anything before and after the colon :
lottery.loc[:, ['Draw Date', 'Winning Numbers']].head(4)

### Selecting an entire Column

Remember we can extract the column in the form of a **Series** using:

`table_name['Name of column']`

In [None]:
numbers_column = lottery['Winning Numbers']
numbers_column.head(5) # we can also use .head with Series!

### Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let's go back to the simpler `animal_info` table.

In [None]:
animal_info

In [None]:
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
animal_info[selection]

## Filtering Data

So far we have selected data based off of row numbers and column headers. Let's work on filtering data more precisely.

`table[condition]`

In [None]:
condition = lottery['Multiplier'] == 2.0
lottery[condition].head(5)

The above code only selects rows that have Multiplier equal to 2.0

If you want to select rows from a data frame without making a separate `condition` variable, use this syntax:

`table[(table[column] == condition)]`

In [None]:
lottery[(lottery['Multiplier'] == 2.0)].head(10)

### Apply multiple conditions!

 `table[ (condition 1)  &  (condition 2) ]`
 
 `table[ (condition 1) | (condition 2)]`

 
<font color="blue">Class Exercise:</font> select the values from `lottery` that have multipliers larger than two.

In [None]:
result = ...
result.head(3)

<font color="blue">Class Exercise:</font> select the rows from `lottery` that were drawn on either June 12, 2018 or on 	February 8, 2011.

In [None]:
...

<font color="blue">Class Exercise:</font> select the rows from `lottery` that have a mega ball of 8 and a multiplier of 3.0

In [None]:
...

### Thorough explanation:

Remember that calling `lottery['Multiplier']` returns a **Series** of all of the Multipliers.

Checking if values in the series are equal to `2.0` results in an array of {True, False} values. 

Then, we select rows based off of this boolean array. Thus, we could also do:

In [None]:
multiplier = lottery['Multiplier']
equalto_two = (multiplier == 2.0)  # equalto_two is now an array of True/False variables!
lottery[equalto_two].head(5)

## Using Numpy with Pandas

How many rows does our `lottery` table have?

In [None]:
len(lottery)

Luckily, **Numpy** functions treat pandas **Series** as np.arrays.

<font color="blue">Exercise:</font> What is the largest multiplier value in `lottery`?

In [None]:
largest_multiplier = ...
largest_multiplier

<font color="blue">Exercise:</font> How many lottery numbers have a multiplier of 4.0?

Hint: How could we find the total number of lottery numbers? Now narrow that to only those with multiplier equal to 4.0.

In [None]:
lottery_multiplierfour = ...
lottery_multiplierfour

### np.unique

In [None]:
# return an array with an element for each unique value in the Series/np.array
np.unique(lottery['Multiplier'])

The [.unique() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) can also be applied to pandas Series to find each of the unique values:

In [None]:
lottery['Multiplier'].unique()

<font color="blue">Class Exercise:</font> Find the number of different unique mega ball numbers in our dataset.

In [None]:
...

## Copy vs View

Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).

In [None]:
favorite_animals = animal_info.copy()
favorite_animals

Let's say I am happy with those ratings. But Marissa loves sharks! Let's make a "new" dataframe and change the ratings accordingly:

In [None]:
marissas_animals = favorite_animals
marissas_animals['rating'] = [5, 0, 0, 0]
marissas_animals

And taking a look back at my favorite animals:

In [None]:
favorite_animals

What happened is that marissas_animals returned a *view* on my dataframe. 

In [None]:
animal_info

Did not affect original dataframe as pandas created a brand new dataframe with identical values instead.

### SettingWithCopyWarning
 
TL;DR: Use .loc instead of square brackets to index into data when adding new columns or changing values.

Let's pretend Marissa dislikes sharks.

In [None]:
marissas_animals[marissas_animals['animal'] == 'shark']

In [None]:
marissas_animals[marissas_animals['animal'] == 'shark']['rating'] = -100
marissas_animals

In [None]:
marissas_animals['rating']

In [None]:
marissas_animals['rating'][0] = -100
marissas_animals

In [None]:
marissas_animals.loc[1, 'rating'] = 1738
marissas_animals

## [optional] Group By

We won't have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!

In the previous section we calculated the number of unique lottery numbers.

`groupby` to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by ""Multiplier" we would create groups of unique multipliers.

We can apply the function `sum` to each group. This will sum the other numerical column, 'Counts' which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

# <font id="2.5" color="blue">Plot with Pandas</font>

In [None]:
# %matplotlib inline

[Pandas.plot documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)

Pandas comes with a built-in `plot` method that can be very useful! `pandas.plot` actually uses `matplotlib` behind the scenes!

`tips` contains data about a fictional restaurant.

In [None]:
tips = sns.load_dataset('tips')
tips.head()

## Line Graphs

In [None]:
total_bill = tips['total_bill']
total_bill.plot(kind="line")  #kind='line' is optional

## Bar Graphs

We can modify our data before we graph it to analyze different things.

In [None]:
day = tips['day'].value_counts()
day.plot(kind="bar")

<font color="blue">Class Exercise:</font> How could we graph the counts of female and male customers?

In [None]:
...

# <font color="blue" id="3">Matplotlib</font>


## Line Graphs
You can use [`plt.plot()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) to create line graphs! The required arguments are a list of x-values and a list of y-values.

In [None]:
np.random.seed(18) # To ensure that the random number generation is always the same
plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
plt.show()

In [None]:
%matplotlib inline

plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
# plt.show() no longer required

## Bar Graphs and Histograms
Let's load in a built-in dataset from Seaborn and take a quick look. (We are using .dropna() to remove all null or missing values, for example's sake. In real data analysis this isn't always the best option - look more into data cleaning!)

In [None]:
titanic = sns.load_dataset('titanic').dropna()
titanic.head()

In [None]:
who_counts = titanic['embark_town'].value_counts()
who_counts

Bar graphs are plotted using the function plt.bar().

In [None]:
plt.bar(who_counts.index, who_counts)

You can also make horizontal bar graphs using plt.barh()

In [None]:
plt.barh(who_counts.index, who_counts)

Histograms can be plotted in matplotlib using [`plt.hist()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html).
This will take one required argument of the x-axis variable.

In [None]:
plt.hist(tips['total_bill'])

Adding a `;` after the function will prevent anything extra from being returned, showing only the desired plot.

In [None]:
plt.hist(tips['total_bill']);

## Scatterplots
Scatterplots can be made using [`plt.scatter()`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html). It takes in two arguments: x-values and y-values.

In [None]:
plt.scatter(titanic['age'], titanic['fare']);

Adding labels to your graphs is an important step to ensure you have easily understable visualizations. This includes a title, axis labels, and a legend if applicable.

In [None]:
plt.scatter(titanic['age'], titanic['fare'])
plt.xlabel('Passenger Age')
plt.ylabel('Fare Amount')
plt.title('Age vs Fare');

In [None]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

# Points with female customer: 'sex' == 'Ffemale'
plt.scatter(x=titanic.loc[titanic['sex'] == 'female', 'age'], 
            y=titanic.loc[titanic['sex'] == 'female', 'fare'],
            label='female', alpha=0.6)

# Points with male customers, 'sex' == 'male'
plt.scatter(x=titanic.loc[titanic['sex'] == 'male', 'age'], 
            y=titanic.loc[titanic['sex'] == 'male', 'fare'],
            label='male', alpha=0.6)

plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare (by Sex)')
plt.legend();

## Exercises in Matplotlib
We'll do the exercises using a dataset from kaggle: [Students Performance in Exams dataset](https://www.kaggle.com/spscientist/students-performance-in-exams). This is a fictional dataset as stated at the [source](http://roycekimmons.com/tools/generated_data/exams) so we are looking at this for example's sake, with no real-world implications.

First, let's read it in and take a look:

In [None]:
exams = pd.read_csv('StudentsPerformance.csv')
exams.head()

Let's also take a look at the different ethnic groups, using the [.unique() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) for pandas Series:

(Note: these 'race/ethnicity' values don't represent any real racial or ethnic groups, and are just fictional groups labeled A, B, C, etc.)

In [None]:
exams['race/ethnicity'].unique()

<font color="blue">Exercise:</font> Create a bar graph with the parental level of education along the x-axis, and the number of students who fit in each category along the y-axis. Make sure to add labels.

Hint: To make the x-axis categories more readable, try the function plt.xticks() with the argument rotation and set it equal to the number of degrees you want to rotate the labels by.

In [None]:
parental_level = ...
plt.bar(..., ...)
...
...
...
...;

<font color="blue">Exercise:</font> Create a graph showing the distribution of writing scores. Should you use a bar graph, or a histogram for this? Don't forget to add labels.

In [None]:
...
...
...
...

<font color="blue">Exercise:</font> Create a basic scatterplot of the math scores versus the reading scores. Label your axes and title!

In [None]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

plt.scatter(..., ...)
plt.xlabel(...)
plt.ylabel(...)
plt.title(...);

<font color="blue">Exercise:</font> This time, create the same scatterplot, but assign a different color for each ethnic group.

In [None]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

...


plt.xlabel(...)
plt.ylabel(...)
plt.title(...)
plt.legend();

How could we create this plot in a less repetitive way? How could we use our own plotting function and a for loop to do this?

In [None]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

def plot_by_race(race, x, y):
    plt.scatter(x=..., x],
             y=..., y],
             label=...)

for race in exams['race/ethnicity'].unique():
    plot_by_race(race, ..., ...)

plt.xlabel('Math Score')
plt.ylabel('Reading Score')
plt.title('Math Score vs Reading Score (by Race)')
plt.legend();

# <font id="4" color="blue"> Seaborn</font>

## Histogram
Let's look at the titanic dataset again. Seaborn's version of the histogram function is sns.distplot(). By default, it shows a relative distribution and overlays a kernel density estimator. To just show a plain histogram, you can add the argument kde=False.

In [None]:
plt.figure(figsize=(15, 10))
plt.subplot(1, 2, 1)
sns.distplot(titanic['age'])

plt.subplot(1, 2, 2)
sns.distplot(titanic['age'], kde=False);

## Scatterplot
To create a scatterplot using seaborn, you can use sns.lmplot(). It'll take x-values and y-values, and overlay a least-squares regression line and standard deviation

Note: You can use pandas indexing like we did with matplotlib, or you can pass the dataset into the data argument and refer to columns by their names instead.

In [None]:
sns.lmplot(x='age', y='fare', data=titanic);

Let's do that same plot from earlier, where we grouped by sex. In seaborn, we only need to pass in an additional argument of hue to color the dots by a category:

In [None]:
sns.lmplot(x='age', y='fare', hue='sex', data=titanic);

We can turn off the regression line with fit_reg=False

In [None]:
sns.lmplot(x='age', y='fare', hue='sex', data=titanic, fit_reg=False);

## Seaborn Exercises
<font color="blue">Exercise:</font> Your turn! Create a histogram of the writing scores in the exams dataset.

In [None]:
...

<font color="blue">Exercise:</font> Now create a histogram of the math scores in the exams dataset, without the kernel density estimator.

In [None]:
...

<font color="blue">Exercise:</font> Now try to create a scatterplot of writing scores versus math scores, and have the points colored based on the kind of lunch each student received. Try turning off the regression line as well.

In [None]:
...

This is the end of our workshop! Thank you for coming out.

We hope you learned a thing or two about Python libraries and how to visualize data. Keep this notebook and the slides handy for your future reference!

<img width="120" src="https://dss.berkeley.edu/static/img/logo.jpg"/>