# Day 7

* [NumPy Mathematical Operations](#numpy-math-operations)
* [Pandas](#pandas)
* [Visualizations](#matplotlib)

## Numpy math operations

We can perform basic mathematical operations on a single array as well as between arrays. 


### Single Array Math

In [None]:
import numpy as np

# Creating a new array of random values with shape (3,3)


In [None]:
# Add 5 to the array and create a new array with the result
# random_arr will remain unchanged 


In [None]:
# Modifying random_arr itself and subtrating 5 from it


In [None]:
# Another way of modifying random_arr 
# The following method can be used with int, float, and string variables too
random_arr += 5  # this means random_arr = random_arr + 5

In [None]:
# Other mathematical operations can be used with numpy arrays like // and **


### Multiple Array Math 

In [None]:
# Adding two different columns of random_arr into added_col

# what is the resulting shape of added_col?


In [None]:
# Multiplying two columns of random_arr 
# this is element-wise multiplication


## Pandas

### Intro

As promised I have a small primer on pandas. Please refer to the [documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html) for more details and what is below is mostly a summary of this tutorial.

In [None]:
# import package


In [None]:
# read data from csv yesterday


A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

In [None]:
# look at data in nice pandas table format


Each column in a `DataFrame` is a `Series`.

In [None]:
# get a specific column by name using brackets and the name of the column


In [None]:
# print type of column


It's easy to apply mathematical operations to columns of dataframes:

In [None]:
# multiple column time some number
# get new column with same name but with the result of the multiplication


In [None]:
# get max density


In [None]:
# you can also do it over all rows and get the maximum of each column with the axis parameter
#axis=0 means over all rows, axis=1 means over all columns


In [None]:
# or with the apply method. This is useful for aggregating data
# you can pass in any "summary" function like max,min, etc.!


Print a bunch of basic statistics of a column.

In [None]:
# with the describe method.


## Subsets of data in pandas

How do I slice or get only certain columns? 

In [None]:
# print data again for reference


In [None]:
# subset two columns


In [None]:
# result is a new dataframe


In [None]:
# let's say we only want pH values greater than 3.42


In [None]:
# alcohol summary


In [None]:
# I want only the density of objects with low ph < 3.42
# using `loc`


In [None]:
# I’m interested in rows 10 till 25 and columns 3 to 5.
# use the iloc method


In [None]:
# get density of wintes with low ph and high alcohol content.


## Visualizations

The matplotlib library is a fairly well-documented libary with many examples and tutorials. The library makes it easy to generate plots and save them to files. 

It can be used with lists as well as numpy arrays. 

Full Documentation: https://matplotlib.org/stable/users/index.html

Official Tutorials: https://matplotlib.org/stable/tutorials/index.html

In [None]:
# Importing pyplot from the matplotlib library  


### Using the plot() function 

The coordinates of the points or line nodes are given by x and y, the first two arguments of the plot() function. We can also specify a third argument to modify the formatting of the graph. When formatting is not specified by the user, its default value to the plot() function is "b-". This stands for the blue color (b) and solid line (-). 

In [None]:
# Declaring some data 
x_data = [1, 2, 3, 4]
y_data = [1, 4, 9, 16]

# Plotting the data using the plot() function

# Need to call plt.show() 



# or use magic command %matplotlib inline at the beginning of the notebook


In [None]:
# Setting the labels for axis 


# Again displaying the plot


In [None]:
plt.xlabel("Data X")
plt.ylabel("Data Y")

# Setting the title of the graph 


plt.plot(x_data, y_data)
plt.show()

In [None]:
# Saving the graph 
plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

plt.plot(x_data, y_data)

# Call the savefig function with filename as argument after having done the plotting


In [None]:
# Changing the formatting 
plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

# r - red, o - circular markers in plt.plot()


In [None]:
# The first and the last data points are not fully visible 
# Let's change the axis limits to make them more visible 


plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

plt.plot(x_data, y_data, "ro")
plt.show()

In [None]:
# Multiple line plots on the same graph with legend
plt.xlim(0,5)
plt.ylim(0,20) 

# create new data
y2_data = ...

# plot both data sets with different markers and set labels

# Making sure a legend is provided by using the label argumet 


In [None]:
# Generating a sequence sampled at intervals of 0.2 between 0.0 and 0.5
#  like range(0,5,0.2) but returns a numpy array
y1 = ...

plt.plot(y1)
plt.show()

### Lecture Practice (15 mins) 

Documentation for plot: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

1. Let's do some numpy array math! We have generated a 1D array y1 in the cell above. Create two more arrays, y2 and y3. 
    1. Data in y2 should be twice of the data in y1
    2. Data in y3 should be cube of the data in y1 <br><br>

2. Create a visualization that looks like below: 

![curves-cubic](day7_three_lines.png)

HINT: Color codes - red: 'r', blue: 'b', green: 'g'

### Plot in pandas

In [None]:
# import pandas and read data again
import pandas as pd
data = pd.read_csv('day6-winequality-red.csv', delimiter=';')

In [None]:
# simple plot of alcohol


In [None]:
# scatter plot of alcohol and pH
# use parameter kind='scatter' to get a scatter plot
# alpha changes the transparency of the points and c is the color


In [None]:
# box plot of alcohol
# you can use box method instead of kind='box'


In [None]:
# box plot "by" another quantity like quality
# useful for discrete data


In [None]:
# create histogram of alcohol with 20 bins
# set correct labels and add a median vertical line


#### Multiple subplots 

In [None]:
# box plot of alcohol, density and pH
# separate axis for each using subplots argument.


In [None]:
# now plot area at the same time for these quantities


### Lecture Practice 

1. Create a figure containing separate boxplots for every quantity in the dataframe.

2. Create a scatter plot of density vs alcohol separated by median quality (half points in each group), using a blue/red color for the dots, some transparency, title, legends, and ensure appropriate labels.

3. Create a single figure with three histograms with quantities: alcohol, density, pH. Plot the mean and median of each histogram as a vertical line. **HINT**: What does the `plot.hist` function return? What are the methods of each object?