# Chapter 1 - Review of Basic Libraries
This section reviews the basic libraries that we are using in this course. (Approximately 2hours). But before starting, let's review our environment (jupyter) for a second.

<img alt="NumPy logo 2020.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/31/NumPy_logo_2020.svg/220px-NumPy_logo_2020.svg.png">
Numpy is Python library for high performance calculations.
Numpy provides the following:

<br>
<ol>
<li> High performance storing, computation</li>
    <li> Support, multi dimensional arrays and matrices</li>
    <li> A large collections of high level mathemtaical operations</li>
</ol>

To install numpy in anaconda use:
`conda install -c anaconda numpy`

In [None]:
import numpy as np # it is usually loaded

## Numpy ndarray
The building block for numpy is homogeneous multidimensional array called __ndarray__. 

In [None]:
a = np.arange(15).reshape(3, 5) # Return evenly spaced values within a given interval. 
a

In [None]:
type(a)

#### Properties
ndarray object has some properties. The most important ones are __shape__, __dtype__ and __size__.

In [None]:
print("a.shape", a.shape) # the dimensions of the array 
print("a.ndim", a.ndim) # the number of axes (dimensions) of the array.
print("a.size", a.size) # the total number of elements of the array.
print("a.dtype", a.dtype) # an object describing the type of the elements in the array. 
print("a.itemsize", a.itemsize) # the size in bytes of each element of the array. 
print("a.data", a.data) # the buffer containing the actual elements of the array.
# You can call a.data.tolist() to see the buffer

### Array creation
There are certain numpy functions for creating arrays. Let's review some of them.

#### Arrays from Py lists

In [None]:
# Creating from list objects
a = np.array([1,2,3])
print(a)
print(a.dtype)

In [None]:
a = np.array([1,2,3], dtype=np.float64)
print(a)
print(a.dtype)

In [None]:
a = np.array([[1,2,3], [4,5,6], [7,8,9]], dtype=np.float64)
print(a)
print("shape:", a.shape)

#### Arrays with initial values
You can initialize ndarrays of arbitrary shape with __zero__, __ones__,  __eye__, __arange__, __linespace__, __logspace__ and __empty__ functions.

In [None]:
np.zeros(shape=(4,4), dtype=np.float64) # all zero ndarray

In [None]:
np.ones((4,4))

In [None]:
np.eye(4) # The identity matrix. The argument is a single number like n and return nxn matrix

In [None]:
np.arange(10) # It is like python range builtin

In [None]:
np.arange(start=5, stop=100, step=10)

In [None]:
np.linspace(start=0, stop=1, num=10) # Equally spaced num points between start and stop

In [None]:
# The end point is included unless it is excluded.
np.linspace(start=0, stop=1, num=10, endpoint=False) 

In [None]:
np.empty((4,4)) # This returns an array without any initialization like C

In [None]:
np.logspace(1,4,4) # It is a little bit different than logspace in MATLAB and the syntax of linspace

#### Random Generator

In [None]:
# Unifrom Random
np.random.rand(5)

In [None]:
np.random.rand(5,5)

In [None]:
# Normally Distributed
np.random.randn(5)

<img src="https://img.icons8.com/color/344/light.png" width=70>  __np.random:__<br>

<br>
np.random offers more distributions. Also beside random generation, it has more functions for working with pdf and cdf functions. Check the documents for more.

#### Reviewing Some Basic Operations on ndarray

In [None]:
a = np.array([[1,2], [3,4]],dtype=np.float32)
b = np.array([[-1,-2], [-3,-4]],dtype=np.float32)
c = np.arange(9).reshape((3,3))
print(a, b, c, sep="\n\n")

In [None]:
a + b # Matrix (element wise) addition and subtraction

In [None]:
a * b # Element wise product

In [None]:
3 * a # Scalar-Matrix multiplication

In [None]:
a/b # Element-wise division

In [None]:
np.matmul(a,b) # Matrix Multiplication

In [None]:
a**2 # Power

In [None]:
# Indexing (zero-based, row-wise)
print(a[0])
print(b[0][0])
# Two dimensional indexing
print(a[0,0])

In [None]:
print(c[0:2]) # Slicing
print(c[:,1]) # Slicing with default start 
print(c[1,:]) # slicing with default end
print(c[0:2, 1:2]) # 2D Slicing

In [None]:
# Iteration over ndarray
for row in c:
    print(row, end="\n----------\n")

In [None]:
# Iteration over elements in flat mode
for el in c.flat:
    print(el, end=' ')

#### Shape Operations
You can change the shape of a matrix. Think about it as representation of stored numbers.

In [None]:
print(c.T) # The transpose of matrix

In [None]:
print(c.ravel()) # Vectorizing matrix

In [None]:
d = np.eye(4)
print(d, end="\n----------\n")
print(d.reshape((2,8)))

In [None]:
d.resize((2,8)) # Change the size in place
print(d)

#### Some useful numpy functions
Let's review some of useful functions in numpy

In [None]:
np.log(a)

In [None]:
np.sin(a)

In [None]:
a.sum(), a.cumsum()

In [None]:
np.sum(a), np.sum(a, axis=1)

In [None]:
a.max(), a.max(axis=0)

In [None]:
a.argmax(), a.flat[a.argmax()]

<img src="https://img.icons8.com/color/344/light.png" width=70>  __Hint:__<br>
What is np.reshape(-1,1)?
<br>
in reshape you can pass one of arguments as -1 and let python decide about that using the original size.

In [None]:
d = np.eye(6)
d.reshape(-1,9) # The original one has 6x6 =36 elements. so if you say reshape to (-1, 9) it will be 4x9

In [None]:
# reshape(-1,1) is just converting a matrix to a 1D matrix
print(a.reshape(-1,1))

<img src= "https://img.icons8.com/external-flaticons-flat-flat-icons/344/external-question-100-most-used-icons-flaticons-flat-flat-icons.png" width=70> __Question__:<br>
Write a program which calculates the values of function y = sin(x*x) for 100 points in range of [0, pi]. The find maximum and minimum value of the function and corresponding x values. <br>
Hint: Use np.pi for the pie number. (3 lines of code - Approx. 5 mins)

In [None]:
x = np.linspace(0,np.pi,100)
f = np.sin(x*x)
f.max(), x[f.argmax()]

<br><br>
<img src="https://pandas.pydata.org/docs/_static/pandas.svg" class="logo" alt="Pandas" width=300><br>
Pandas is a python library for working and manipulating data (tabular format mostly).<br>
For installing pandas use the following command:<br>
`conda install -c anaconda pandas`
<br> Pandas uses numpy underhood. 
The building blocks in pandas are series and dataframes. Series are used for storing a sequence of objects (let's a single column in a table) and dataframes are used for storing tabular data. a Dataframe is collection of series with the same length. Pandas dataframes are similar to Excel spreadsheets

In [None]:
import pandas as pd

### Creating Dataframes

In [None]:
# Create Dataframes
df = pd.DataFrame({"name":["Alice", "Bob", "Craig", "Sara", "John"], "Age":[26, 34, 22,50, 45], "experience":[1, 6, 2, 10, 8],
                   "Title":["IT specialist", "Security Expert", "Developer", "Manager", "Scientist"],
                   "salary":[60000, 80000, 65000, 110000, 100000]}) 
df

In [None]:
# Creating by loading from files
customer_data = pd.read_csv("Wholesale customers data.csv") # UCI dataset: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
customer_data

<img src="https://img.icons8.com/color/344/light.png" width=30 alt="Hint"> __Hint:__ Pandas has functions for reading excel, parquet, html, orc and so on. They all have the format of `pd.read_ext` where `ext` is the format. Also, you can write a dataframe to disk using functions like `to_csv` or `to_excel`, ...

In [None]:
# Writing data back to disk
df.to_csv("test_dataframe.csv")
customer_data.to_excel("Wholesale customers data.xlsx")  # you need to install openpyxl. conda install -c anaconda openpyxl


### Dataframe Properties

In [None]:
end_line = "\n" + '-'*40 + "\n"
# Shape
print("shape: ", customer_data.shape, end=end_line)
# Size
print("size", customer_data.size, end=end_line)
# Column names
print("columns: ", customer_data.columns, end=end_line)
# Data types
print("data types: ", customer_data.dtypes, end=end_line)

### Indexing and Iterating Dataframes

In [None]:
# Indexing operator in dataframe
customer_data["Frozen"]

In [None]:
print(type(customer_data["Frozen"]))

In [None]:
customer_data[0] # This raise an exception since we don't have a column called 0

In [None]:
customer_data["Channel"][0] # If you index a column it returns a series and you can index a series

In [None]:
customer_data.at[0,"Channel"] # 2D indexing - row, col -> col has to be column name

In [None]:
customer_data.iat[0, 0] # this is addressing both col and row using numbers

In [None]:
# A more powerful tool to index 2D is loc
customer_data.loc[0, "Channel"]

In [None]:
customer_data.loc[[1,3,5], ["Channel", "Frozen"]]

In [None]:
customer_data.iloc[0] # This returns row 0

In [None]:
customer_data.iloc[[0,1,2,3], [0,1,2]]

In [None]:
#Selecting based on condition
customer_data[customer_data["Frozen"]<40]

<img src= "https://img.icons8.com/external-flaticons-flat-flat-icons/344/external-question-100-most-used-icons-flaticons-flat-flat-icons.png" width=70> __Question__:<br>
Guess the output of the following command:<br>
<ol><li>
    <code>customer_data[customer_data["Frozen"]<40]["Grocery"][2]</code> 
                                            </li>
    <li><code>sub_df = (customer_data[customer_data["Frozen"]<40]);sub_df[sub_df["Channel"]==1]</code></li>
</ol>                                           

In [None]:
# Iterating over dataframes
for row in df.iterrows(): # Iterating over rows as tuples of (index, series)
    print(row)

<img src="https://img.icons8.com/color/344/light.png" width=30 alt="Hint"> __Hint:__ There are two similar functions for iterating rows: `iteritems` and `itertuples`

#### Adding, Editing and Deleting Columns

In [None]:
df["Contract"] = "Fulltime"
df

In [None]:
df["Senior"] = df["experience"] > 5
df

In [None]:
df["IC"] = df["Title"].apply(lambda x: x !="Manager")
df

In [None]:
df["SalaryPerExperience"] = df.apply(lambda row: row["salary"]/row["experience"], axis=1) # don't forget axis = 1 for column-wise 
df

In [None]:
df = df.drop(["IC"], axis=1)
df

#### Data Summarization

In [None]:
customer_data.info()

In [None]:
customer_data.value_counts()

In [None]:
customer_data.describe().T

#### Data Aggregations / Grouping

In [None]:
print(customer_data.max(), end=end_line)
print(customer_data[["Grocery", "Frozen", "Milk"]].mean(), end=end_line)
# also there are min, max, count, std, var, .... functions on groups

In [None]:
customer_data.groupby("Region").count()

In [None]:
customer_data.groupby("Region")["Region"].count().reset_index(name="count")

In [None]:
customer_data.groupby(["Region", "Channel"])["Channel"].count().reset_index(name="count")

In [None]:
customer_data.groupby("Region")[["Milk", "Frozen"]].agg(["sum", "max", "min", "mean"])

In [None]:
customer_data.groupby("Region")[["Milk", "Frozen"]].describe().T

<img src="https://img.icons8.com/color/344/light.png" width=30 alt="Hint"> __Pandasql:__ you can treat padnas dataframe as a SQL table and run queries on it using padasql library. <br>
The library can be installed using the following command:<br>
`conda install -c anaconda pandasql` <br>
The library supports many of SQL functions, but it is not still supporting everything. Another solution would be using pyspark and converting the dataframe to a pyspark dataframe and running hive queries on it.

In [None]:
# pandasql example
import pandasql as ps
query = """
SELECT 
    Region, sum(Frozen) as sum_frozen, sum(Milk) as sum_milk 
FROM 
    customer_data 
GROUP BY
    Region
"""
ps.sqldf(query, locals())

<br><br>
<img alt="Matplotlib logo.svg" src="//upload.wikimedia.org/wikipedia/en/thumb/5/56/Matplotlib_logo.svg/300px-Matplotlib_logo.svg.png" decoding="async" width="300" height="55" srcset="//upload.wikimedia.org/wikipedia/en/thumb/5/56/Matplotlib_logo.svg/450px-Matplotlib_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/5/56/Matplotlib_logo.svg/600px-Matplotlib_logo.svg.png 2x" data-file-width="540" data-file-height="99"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

### Visualization Using Python
Matplotlib is the fundamental visualization library for data science work. It is designed to be compatible with MATLAB and it uses numpy under hood.
<br>For installation run the following command: `conda install -c conda-forge matplotlib`
<br> There are two important extensions for Matplotlib: __Seaborn__ and __Mplot3d__. Seaborn focus on providing high level API and supports for Pandas and Mplot3d focuses on 3D plotting. Use the following commands to install them:
<br> Seaborn: `conda install -c anaconda seaborn`
<br> Mplot3D: `conda install -c conda-forge mpld3`

<br><br>* For using Matplotlib in jupyter notebooks you need to to call the following magic (before) improting it: <br>`%matplotlib inline`

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [None]:
x = np.linspace(0, 4*np.pi, 500)
y = np.sin(2*x)
plt.plot(x,y)

<table>
   <tr>
       <th><img src="https://img.icons8.com/external-flaticons-lineal-color-flat-icons/344/external-coffee-cup-bakery-flaticons-lineal-color-flat-icons.png" alt="icon" width=80></th>
    <td><img src ="https://matplotlib.org/stable/_images/anatomy.png" width=500></td>
       <td><h3><h2>Matplotlib Figures' Anatomy</h2> Did you know that you can set every element in a Matplotlib graphs?</h3> <br>You can use the set methods (for example `set_title`) to adjust every piece of it. <br> Check it later!</td>
       
  </tr>
</table>

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(x, x, label='linear')  # Plot some data on the axes.
ax.plot(x, x**2, label='quadratic')  # Plot more data on the axes...
ax.plot(x, x**3, label='cubic')  # ... and some more.
ax.set_xlabel('x label')  # Add an x-label to the axes.
ax.set_ylabel('y label')  # Add a y-label to the axes.
ax.set_title("Simple Plot")  # Add a title to the axes.
ax.legend();  # Add a legend.


#### Visualizing Random Data

In [None]:
x = np.random.randn(300,1)
y = np.random.randn(300,1)
fig, ax = plt.subplots(1,4, figsize=(20, 5))
ax[0].scatter(x,y)
ax[1].hist(x)
ax[2].boxplot(x)
ax[3].violinplot(x)
# Set title for each one
ax[0].set_title("scatter")
ax[1].set_title("hist")
ax[2].set_title("boxplot")
_=ax[3].set_title("violinplot")

#### 2D visualization of 3D graphs

<img src="https://img.icons8.com/color/344/light.png" width=30 alt="Hint"> __hint:__ np.meshgrid(x, y) creates the coordinate systems for given values in x and y. It is used for 3D plotting

In [None]:
# SCHWEFEL FUNCTION
x, y = np.meshgrid(np.linspace(-512, 512, 1000), np.linspace(-512, 512, 1000))
z = 418.9829*2 - x * np.sin( np.sqrt( abs( x )))-y*np.sin(np.sqrt(abs(y)))

minz = z.min()
maxz = z.max()

fig, axs = plt.subplots(2, 2, figsize=(24,15))

# pcolormesh
pc = axs[0, 0].pcolormesh(x, y, z, vmin=minz, vmax=maxz, cmap='RdBu_r', shading='auto')
fig.colorbar(pc, ax=axs[0, 0])
axs[0, 0].set_title('pcolormesh()')

# filled Contour
co = axs[0, 1].contourf(x, y, z, levels=np.linspace(minz, maxz, 8))
fig.colorbar(co, ax=axs[0, 1])
axs[0, 1].set_title('contourf()')

# As Image
pc = axs[1, 0].imshow(z, cmap='plasma', interpolation='nearest')
fig.colorbar(pc, ax=axs[1, 0], extend='both')
axs[1, 0].set_title('imshow()')

# Contour lines
co = axs[1, 1].contour(x, y, z, levels=np.linspace(minz, maxz, 8))
fig.colorbar(co, ax=axs[1, 1])
_=axs[1, 1].set_title('contour()')


#### 3D Visualizations

In [None]:
from mpl_toolkits.mplot3d import axes3d

In [None]:
fig = plt.figure(figsize=(24,15))

# Wireframe
ax = fig.add_subplot(1,2,1, projection='3d')
ax.view_init(-140, 60)
ax.plot_wireframe(x,y,z, rstride=10, cstride=10)
ax.set_title("plot_wireframe")
# Surface
ax = fig.add_subplot(1,2,2, projection='3d')
ax.view_init(-140, 60)
ax.plot_surface(x,y,z, cmap= mpl.cm.coolwarm)
ax.set_title("plot_surface")

<img src="https://img.icons8.com/color/344/light.png" width=30> __Schwefel function:__ The function that we used in our presentation is called Schwefel function and is a benchmark (test) function for optimization. There are many of these functions and each one designed for a specific scenario. Check here for a comprehensive list of benchmarks. https://en.wikipedia.org/wiki/Test_functions_for_optimization

<img src="https://img.icons8.com/external-kosonicon-lineal-color-kosonicon/344/external-lab-tool-back-to-school-kosonicon-lineal-color-kosonicon.png" alt="icon" width=80  alt="Labl"> __Himmelblau Function__:<br>
The function defined by the following equation is called Himmelblau's function
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/7af38b9b0297dba9457f986bf09644df396b2af3" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -1.838ex; width:39.051ex; height:5.676ex;" alt="f(x,y)=2x^{2}-1.05x^{4}+{\frac {x^{6}}{6}}+xy+y^{2}">. Its domain is <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/5a3e7bb9757ca235b6171fd392210eeab77c12cc" class="mwe-math-fallback-image-inline" aria-hidden="true" style="vertical-align: -0.671ex; width:13.849ex; height:2.509ex;" alt="-5\leq x,y\leq 5">.

<br>Write a program to the followings:
<ol>
    <li> 3D plot (surface and wireframe) of this function on a 2x1 subplot.</li>
    <li> plot contours (lines, filled) of this function for 10 points between its minimum and maximum (of calculated points)</li>
    <li> boxplot and histogram of function values</li>
</ol>
(Approx 10 min - Do it in groups of 3 to 4 people)

### Animations
Matplotlib supports different types of animations. They are useful for visualizing iterative processes. One category of animations is animating by calling a function iteratively. Let's demonstrate it using an example.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation
import numpy as np
from IPython.display import display, clear_output

t = np.linspace(0,2*np.pi)
x = np.sin(t)

fig, ax = plt.subplots()
l, = ax.plot([0,2*np.pi],[-1,1])

animate = lambda i: l.set_data(t[:i], x[:i])

for i in range(len(x)):
    animate(i)
    clear_output(wait=True) # this and next line are needed for working in Jupyter env.
    display(fig)
clear_output(wait=True)

 <img src="https://seaborn.pydata.org/_static/logo-wide-lightbg.svg" width=200><br>
 Seaborn is an extension to MPL which adds more styling also API for working with pandas dataframes.
 <br> Conda installation: `conda install -c anaconda seaborn`

In [None]:
import seaborn as sns

In [None]:
y = np.random.randn(1000)
x = np.linspace(1,1000,1000)
sns.lineplot(x=x, y=y) # Nothing new, this is similar to MPL

In [None]:
# But it has interface for pandas too. Let's make an example
df = pd.DataFrame({"x":x, "y":y})
sns.lineplot(data=df, x=x, y=y)

In [None]:
sns.scatterplot(data=customer_data, x="Frozen", y="Fresh", hue="Region", size="Channel").set_xlim([0., 10000])

In [None]:
sns.set_theme(style="ticks", palette="pastel")
sns.boxplot(data=customer_data, x="Region", y="Frozen", hue="Channel", showfliers=False)

In [None]:
sns.set_theme(style="darkgrid")
sns.displot(
    customer_data, x="Grocery" , col="Region", row="Channel")
#    binwidth=3, height=3, facet_kws=dict(margin_titles=True),)

In [None]:
sns.pairplot(customer_data[["Milk", "Frozen", "Fresh", "Grocery"]])

In [None]:
customer_data

<img src="https://img.icons8.com/external-kosonicon-lineal-color-kosonicon/344/external-lab-tool-back-to-school-kosonicon-lineal-color-kosonicon.png" alt="icon" width=80  alt="Labl"> __Online Retail Data Visualization__:<br>
The file `Online_Retail.xlsx` contains online retail information (UCI data: https://archive.ics.uci.edu/ml/datasets/Online+Retail#). It has the following fields:
<ul>
<li>
    <b>InvoiceNo:</b> Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. </li>
    <li><b>StockCode:</b> Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
    <li><b>Description:</b> Product (item) name. Nominal. </li>
    <li><b>Quantity:</b> The quantities of each product (item) per transaction. Numeric.</li>
<li><b>InvoiceDate:</b> Invice Date and time. Numeric, the day and time when each transaction was generated.</li>
    <li><b>UnitPrice:</b> Unit price. Numeric, Product price per unit in sterling.</li>
    <li><b>CustomerID:</b> Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.</li>
    <li><b>Country:</b> Country name. Nominal, the name of the country where each customer resides.</li>
</ul>

<br>__Hint:__ for converting a column of string to date use the following format for example `online_retail["InvoiceDate"].dt` like  `online_retail["InvoiceDate"].dt.to_pydatetime()`.
<br><br>
Load data and based on what you have learned about seaborn and pandas write a program to the followings:

<ol>
    <li> Explore the dataset with pandas describe and info functions</li>
    <li> Find and print the most popular product per country</li>
    <li>Plot the distribution of prices for the following countries (France, Germany, EIRE)</li>
    <li> Plot number of transactions per day for year of 2010 and for United Kingdom </li>
    
</ol>