# Part 1: Introduction to Python Libraries

# 1. Libraries

A standard distribution of Python comes with a [Standard Library](https://docs.python.org/2.7/library/) that provides the basic system functionalities such as file I/O. This is further enhanced by other Python packages developed by the community, which can be found on [Python Package Index (PyPI)](https://pypi.org/) repository. In data science, the 3 most fundamental software packages are:
+ Numpy: Efficient manipulation of N-dimensional matrices with linear algebra functions
+ Pandas: Provides a slew of data structures and data analysis tools (Excel alternative)
+ Matplotlib: 2D plotting library for images, histograms, barcharts

## 1.1 Import Libraries

As a result of doing this, important functions and objects from these libraries will be imported into our workspace and we can call functions with prepended package names like `np.array()`, `pd.DataFrame()`

# 2. Numpy

## 2.1 Matrices and Vectors with Numpy

### Vector
$$
\vec{x} = \begin{bmatrix}1& 2& 3 \end{bmatrix}, \ \ \ \ \ 
\vec{x} = \begin{bmatrix}1 \\2 \\ 3 \end{bmatrix}
$$

### Matrix
$$
\vec{Y} = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}
$$

### exercise
1. 3x4 matrix

2. 4x3 matrix

## 2.2 Operations with Numpy objects

**Matrix Multiplication**
$\vec{Y}$: 3x3,
$\vec{x}$: 3x1
$$
\vec{Y}*\vec{x} = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * \begin{bmatrix}1 \\2 \\ 3 \end{bmatrix} =  \begin{bmatrix}14 \\32 \\ 50 \end{bmatrix}
$$

output： row of $\vec{Y}$ and column of $\vec{x}$

In [None]:
 # np.dot()

$$
\vec{x}*\vec{x}.T = \begin{bmatrix}1 \\2 \\ 3 \end{bmatrix} * \begin{bmatrix}1 & 2 & 3 \end{bmatrix} = \begin{bmatrix}1 & 2 & 3 \\2 & 4 & 6 \\ 3 & 6 & 9 \end{bmatrix}
$$

**Element-wise number operators**
$$
\vec{x} + 10 = \begin{bmatrix}1 \\2 \\ 3 \end{bmatrix} + 10 = \begin{bmatrix}11 \\12 \\ 13 \end{bmatrix}
$$

In [None]:
  # element-wise number operations

$$
(\vec{Y}) ^ 2 = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}^2 = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} = \begin{bmatrix}1 & 4 & 9 \\16 & 25 & 36 \\ 48 & 64 & 81 \end{bmatrix}
$$

In [None]:
 # **

**Broadcasting**
$$
\vec{Y}+\vec{x} = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} + \begin{bmatrix}1 \\2 \\ 3 \end{bmatrix} =   \begin{bmatrix}2 & 3 & 4 \\6 & 7 & 8 \\ 10 & 11 & 12 \end{bmatrix}
$$

In [None]:
 # broadcasting. What happened here?

# 3. Pandas

## 3.1 Import dataset into Pandas DataFrame object

**Import Dataset**

We are using a subset of passenger data on the tragic ship, Titanic [(link)](https://www.kaggle.com/c/titanic/data) 

In [None]:
# Read Data from csv file named 'Titanic.csv'
# create a Pandas.Dataframe

#data = pd.read_csv('data/Titanic.csv')
url = 'https://raw.githubusercontent.com/prasanth-ntu/MLDA-Enthuse-V2/master/Part%201%20-%20Intro%20to%20Python%20and%20Python%20Libraries/data/Titanic.csv'
data = pd.read_csv(url)    

# Inspect the data type
type(data)

**Preliminary view of data**

As a rule of thumb, *columns* store variables whereas *rows* store entries of data

In [None]:
 # head and tail

## 3.2. Attributes of DataFrames
Pandas DataFrames are composed of Index, columns and underlying numpy data
<ul>
    <li>`index`   : the name of each row</li>
    <li>`columns` : the name of each column</li>
    <li>`value`   : all the content except index and columns in the DataFrame</li>
    <li>`sorting()`: sort data with certain conditions </li>
</ul>

In [None]:
# get column attribute  .columns

**Sorting**

In [None]:
# Simple Sorting of data according to a column name, say 'Age', in ascending order .sort_values(by=, ascending=True)

**Missing values**

Missing values affect our analysis on the data. For convenience, we drop all rows with any form of missing data. In the real world, there are techniques to deal with this issue.

In [None]:
 # print with NA & .dropna() & print after dropna

## 3.3 Selection of Data

To get meaningful insights from the data, we try to **focus** only on a specific region of interest, achieved by extracting a smaller fraction from the big dataset. This is called subsetting. Selecting this subset requires us to select/specify that region via 3 entities, in order of importance: `column`, `row`, `data point`.

### Row selection by index
A more meaningful row selection method is by *'Boolean Masking'*

In [None]:
 # indexing

### Data point selection by position index

`.iloc[]` is the primary attribute for selection by absolute position along index and columns.
Here 'i' in 'iloc' means 'implicit' indexing

In [None]:
 # position index

Compare this against the output of `data.head()`. Do you see any resemblance? What happened here?

### Column selection

In [None]:
 # select by variable

### exercise
use .iloc[] to repalce .head() and .tail()

In [None]:
 # .head()

In [None]:
 # .tail()

# 4. Matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
 # randn

In [None]:
plt.plot(s1)

## 4.1. Importance of visualizations
Suppose we create 1000 random profit values to illustrate net profits of a company in a period of 1000 days

In [None]:
# Create a Series
s = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2015', periods=1000))

What difference does it make if we create a graphical representation of it?

In [None]:
 # cumulative sum .cumsum()

## 4.2. Continuous vs Discrete

+ Continuous: Age, Length, Area (sqft), pH
+ Discrete: LastName, Gender, Country, Postcode

**Exercise:** Imagine you are an officer in Singapore Department of Statistics and you are compiling a report about the demography of Singapore. Based on the continuous and discrete variables mentioned in the examples above, what kind of questions will be tickling your mind?

## 4.3. 1D plots

In [None]:
# Create a plot figure (empty canvas)

# Construct histogram from continuous data

# <-- try changing these .hist(bins=20, color = 'orange')

# Labeling the figure

What inference can you draw from this histogram?

## 4.4. 2D plots

**Scatter plot**

Plot values from two variables along two axes to identify possible relationships between them

In [None]:
 # x .linspace & y = 2x + randn

In [None]:
 # empty figure
 # plt.scatter
 # labels