# Day 7

* [NumPy Mathematical Operations](#numpy-math-operations)
* [Pandas](#pandas)
* [Visualizations](#matplotlib)

## Numpy math operations

We can perform basic mathematical operations on a single array as well as between arrays. 


### Single Array Math

In [None]:
import numpy as np

# Creating a new array of random values 
random_arr = np.random.rand(3,3)

random_arr

In [None]:
# The following addition will give a new array 
# random_arr will remain unchanged 
added_five = random_arr + 5

In [None]:
# Modifying random_arr itself 
random_arr = random_arr - 5 
random_arr 

In [None]:
# Another way of modifying random_arr 
# The following method can be used with int, float, and string variables too
random_arr += 5  # this means random_arr = random_arr + 5

In [None]:
# Using other mathematical operators 
random_arr//2, random_arr**2 

### Multiple Array Math 

In [None]:
# Adding two columns of random_arr 
added_col = random_arr[:,0] + random_arr[:,1]

added_col.shape 

In [None]:
# Multiplying two columns of random_arr 
 # element-wise multiplication
added_col * random_arr[:,1]

In [None]:
# Is random_arr changed?
random_arr

## Pandas

### Intro

As promised I have a small primer on pandas. Please refer to the [documentation](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html) for more details and what is below is mostly a summary of this tutorial.

In [None]:
# import package
import pandas as pd

In [None]:
# read data from csv 
data = pd.read_csv('day6-winequality-red.csv', delimiter=';')

A `DataFrame` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

In [None]:
# look at data in nice pandas table format
data

Each column in a `DataFrame` is a `Series`.

In [None]:
# get a specific column by name using brackets
data['fixed acidity']

In [None]:
# print type
type(data['fixed acidity'])

It's easy to apply mathematical operations to columns of dataframes:

In [None]:
# get max density
data["density"].max()

In [None]:
# you can also do it over all rows and get the maximum of each column
data.max(axis=0)

In [None]:
# or with the apply method. This is useful for aggregating data
# you can pass in any "summary" function!
data.apply(max, axis=0)

Print a bunch of basic statistics of a column.

In [None]:
# with the describe method.
data['density'].describe()

## Subsets of data in pandas

How do I slice or get only certain columns? 

In [None]:
# print data again for reference
data

In [None]:
# subset two columns
density_ph = data[['density', 'pH']]
density_ph

In [None]:
# result is a new dataframe
type(density_ph)

In [None]:
# let's say we only want pH values greater than 3.42
high_ph = density_ph[density_ph['pH'] > 3.42]
high_ph

In [None]:
# alcohol summary
data['alcohol'].describe()

In [None]:
# I want only the density of objects with low ph < 3.42
# using the loc method
low_ph_density = data.loc[density_ph['pH'] < 3.42, 'density']
low_ph_density

In [None]:
# I’m interested in rows 10 till 25 and columns 3 to 5.
# use the iloc method
data.iloc[10:25, 3:5]

In [None]:
# get density of wintes with low ph and high alcohol content.
cond = (data['pH'] < 3.42) & (data['alcohol'] > 10)
low_ph_high_alcohol_density = data.loc[cond, 'density']
low_ph_high_alcohol_density

## Visualizations

The matplotlib library is a fairly well-documented libary with many examples and tutorials. The library makes it easy to generate plots and save them to files. 

It can be used with lists as well as numpy arrays. 

Full Documentation: https://matplotlib.org/stable/users/index.html

Official Tutorials: https://matplotlib.org/stable/tutorials/index.html

In [None]:
# Importing pyplot from the matplotlib library  
import matplotlib 
from matplotlib import pyplot as plt 

### Using the plot() function 

The coordinates of the points or line nodes are given by x and y, the first two arguments of the plot() function. We can also specify a third argument to modify the formatting of the graph. When formatting is not specified by the user, its default value to the plot() function is "b-". This stands for the blue color (b) and solid line (-). 

In [None]:
# Declaring some data 
x_data = [1, 2, 3, 4]
y_data = [1, 4, 9, 16]

# Plotting the data using the plot() function
plt.plot(x_data, y_data)

# Need to call plt.show()
plt.show() 

In [None]:
# Setting the labels for axis 
plt.xlabel("Data X")
plt.ylabel("Data Y")

# Again displaying the plot
plt.plot(x_data, y_data) 
plt.show() 

In [None]:
plt.xlabel("Data X")
plt.ylabel("Data Y")

# Setting the title of the graph 
plt.title("Simple Line Graph")

plt.plot(x_data, y_data)
plt.show()

In [None]:
# Saving the graph 
plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

# Call the savefig function with filename as argument
plt.plot(x_data, y_data)
plt.savefig("LineGraph2.png")

In [None]:
# Changing the formatting 
plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

# r - red, o - circular markers 
plt.plot(x_data, y_data, "ro")
plt.show() 

In [None]:
# The first and the last data points are not fully visible 
# Let's change the axis limits to make them more visible 
plt.xlim(0,5)
plt.ylim(0,20)

plt.title("Simple Line Graph")
plt.xlabel("Data X")
plt.ylabel("Data Y")

plt.plot(x_data, y_data, "ro")
plt.show()

In [None]:
# Multiple line plots on the same graph 
plt.xlim(0,5)
plt.ylim(0,20) 

y_data2 = [2, 7, 9, 10]

# Making sure a legend is provided by using the label argumet 
plt.plot(x_data, y_data, "ro", label="Data 1")
# plt.plot(x_data, y_data2, "bx", label="Data 2")

plt.plot(x_data, y_data2, "bx-", linewidth=0.1, label="Data 2")

plt.legend() 
plt.show() 

In [None]:
# Generating a sequence sampled at intervals of 0.2 between 0.0 and 0.5
y1 = np.arange(0., 5., 0.2) #  like range(0,5,0.2) but returns a numpy array

plt.plot(y1)
plt.show()

### Lecture Practice (15 mins) 

Documentation for plot: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

1. Let's do some numpy array math! We have generated a 1D array y1 in the cell above. Create two more arrays, y2 and y3. 
    1. Data in y2 should be twice of the data in y1
    2. Data in y3 should be cube of the data in y1 <br><br>

2. Create a visualization that looks like below: 

![curves-cubic](day7_three_lines.png)

HINT: Color codes - red: 'r', blue: 'b', green: 'g'

In [None]:
y2 = y1*2
y3 = y1**3 

plt.plot(y1, label="Y1-Linear")
plt.plot(y2, "ro-", label="Y2-Double")
plt.plot(y3, "gx-", label="Y3-Cube")
plt.legend()
plt.show() 

### Plot in pandas

In [None]:
# import pandas and read data again
import pandas as pd
data = pd.read_csv('day6-winequality-red.csv', delimiter=';')

In [None]:
# simple plot of alcohol
data['alcohol'].plot()

In [None]:
# scatter plot of alcohol and pH
# alpha changes the transparency of the points
data.plot(x='alcohol', y='pH', kind='scatter', alpha=0.5, c='g')

In [None]:
# box plot of alcohol
data['alcohol'].plot.box()

In [None]:
# box plot "by" another quantity like quality
data.plot.box(column='alcohol', by='quality')

In [None]:
# create histogram of alcohol with 20 bins
# set correct labels and add a median vertical line
data['alcohol'].plot.hist(bins=20)
plt.xlabel('Alcohol')
plt.axvline(data['alcohol'].median(), c='r')

#### Multiple subplots 

In [None]:
# box plot of alcohol, density and pH
# separate axis
data[['alcohol', 'pH', 'density']].plot.box(subplots=True, figsize=(18,10))

In [None]:
axs = data[['alcohol', 'pH', 'quality']].plot.area(figsize=(14, 6), subplots=True)

### Lecture Practice 

1. Create a single figure with three histograms with quantities: alcohol, density, pH. Plot the mean, median, and mode in each histogram.

2. Create a figure containing separate boxplots for every quantity in the dataframe.

3. Create a scatter plot of density vs alcohol separated by median quality (half points in each group), using a blue/red color for the dots, some transparency, title, legends, and ensure appropriate labels.
