## Lesson 4 Overview

1. Numpy
2. Pandas

## Let's load today's lesson!

### Open Azure Notebooks library 

Go to https://notebooks.azure.com -> Sign in if needed -> Select **python-intro**

### Update lesson file to latest version

Select **New** -> **From URL** -> input https://raw.githubusercontent.com/kelvnt/python-intro/master/lessons/Lesson4.ipynb (URL is available in **Lesson4.ipynb**) -> Click outside input then select **Upload** (overwrite if needed)

### Open Jupyter lab

From your browser's bookmark or **Run** -> Change browser URL path from **/nb/tree** to **/nb/lab**

Select **Lesson4.ipynb**

### NumPy

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. One can perform mathematical and logical operations on arrays more efficiently and effortlessly. 

**How you import numpy package to your python program:**
```python 
import numpy as np```
#### Numpy Arrays:

Numpy array is another datatype of python language like the list. Numpy arrays can be created using python list objects.

```python 
np.array(listObject)```

In [None]:
import numpy as np
py_list = [1, 3, 10, 50]
np_list = np.array(py_list)
print(np_list)
print(py_list)

and Numpy assumes values in the array to a single type like booleans, int etc. Trying to create a numpy array with different types, will result in converting all the values to a single type. like the string in below case.

In [None]:
import numpy as np
py_list = [1, 'alice', True] 
np_list = np.array(py_list)  # Numpy arrays contain only one type
print(np_list)               # outputs: ['1' 'alice' 'True']

**What is so special about Numpy arrays ?**

Suppose, you have the list of heights and weights of family members(as below) and asked to calculate the BMI of each. By using lists, one has to iterate each of it and which would be inefficient and tiresome to write.


In [None]:
# BMI formula is weight/height ** 2

heights = [1.73, 1.68, 1.71, 1.89, 1.79]
weights = [65.4, 59.2, 63.6, 88.4, 68.7]

# find BMI

np_heights = np.array(heights)
np_weights = np.array(weights)

bmi = np_weights / np_heights ** 2

print(bmi)

In [None]:
np_heights.tolist()

In [None]:
result = []
for i in range(0,len(heights)):
    bmi = weights[i]/heights[i]**2
    result.append(bmi)
    
print(result)

Let's practice with numpy array operations:

In [None]:
import numpy as np
no_of_seats_per_class = [100, 80, 60, 90]  # list of seats available per class
no_of_students_per_class = [82, 34, 49, 88]  # list of students attended per class

# find no. of seats vacant per class using numpy array operations
# START HERE
num_seats = np.array(no_of_seats_per_class)
num_students = np.array(no_of_students_per_class)
print(num_seats-num_students)

**Yes, Numpy makes list operations less expensive and more efficient.** 

Keep in mind, the same applies for arthematic operators too. For example, 

In [None]:
py_list = [1, 2, 3]
print("sum of lists:", py_list + py_list)   # Results in a new list of 6 elements

np_list = np.array(py_list)
print("sum of np arrays:", np_list + np_list)  # Results in a array of 3 elements which are the sum of elements of the same index


#### Numpy Subsetting 


**slicing** 


Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A Python slice object is constructed by giving a **start**, **stop**, and **step** parameters to the **built-in slice function**. This slice object is passed to the array to extract a part of an array.

```python 
sliceObject = slice(start, stop, step)

arrayObject[sliceObject]```

alternatively, we can use the slicing parameters seperated by colons.

```python 
arrayObject[start:stop:step] ```


In [None]:
arrayObject = np.arange(1, 20)  # arange method will generate values within the interval. syntax arange(start,stop,step)
print("Original array:", arrayObject)
sliceObject = slice(2, 10, 4)

sliceArray = arrayObject[sliceObject]
print("Using slice Object: ", sliceArray)

paramArray = arrayObject[2:10:4]
print("Using slice parameters: ", paramArray)


**Using array of booleans**

Using a conditional statment with square brackets results of np_array results in array of booleans for respective indices and one can filter the array using this result as below.
```python
    booleanArray = conditional operation on arrayObject
    filteredObject = arrayObject[booleanArray]```

Here, we are creating a new array from the result of data comparison.

In [None]:
import numpy as np
arrayObject = np.arange(0, 100, 10)
print("arrayObject:", arrayObject)

booleanArray = arrayObject > 30  # this conditional statement produces the array of booleans
print("booleanArray:", booleanArray)

filteredObject = arrayObject[booleanArray]
print("filteredObject:", filteredObject)

Let's practice with numpy array of booleans:

In [None]:
import numpy as np
list_of_class = ["Politics", "Engineering", "Biology", "Maths"]
no_of_seats_per_class = [100, 80, 60, 90]  # list of seats available per class
no_of_students_per_class = [82, 80, 49, 90]  # list of students attended per class

# print class names with full attendance using numpy array operations and array of booleans:
# START HERE
a = np.array(list_of_class)
b = np.array(no_of_seats_per_class)
c = np.array(no_of_students_per_class)

booleanArray = b == c
print(a[booleanArray])

#### Numpy library

**Commonly used methods** 

**arrayObject.shape**

Returns the tuple of array dimensions

```python 
arrayObject.shape

print(np.array([[8,9],[9,2]]).shape) #Prints: (2, 2)```

**np.zeros**

Create an array of all zeros
   
   ```python
   np.zeros(shape, dtype, order)```

**np.ones**

Create an array of all ones
   
   ```python
   np.ones(shape, dtype, order)```

**np.full**

Create an array of all the same value
   
   ```python
   np.full(shape, fill_value, dtype, order)```

**np.eye**

Return a 2-D array with ones on the diagonal and zeros elsewhere.
   
   ```python
   np.eye(n_rows, n_columns, diagonal_index, dtype, order)```

**np.amin() and numpy.amax()**

These functions return the minimum and the maximum from the elements in the given array along the specified axis.

```python 
   np.amin(arrayObject)

   np.amax(arrayObject)```

**np.percentile()**

Percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The function numpy.percentile() takes the following arguments.

```python
np.percentile(arrayObject, q, axis)```

>arrayObject : input array
>q    : The percentile to compute must be between 0-100
>axis : The axis along which the percentile is to be calculated

**np.median()**

Median is defined as the value separating the higher half of a data sample from the lower half. The numpy.median() function is used as shown in the following program.

```python
np.median(arrayObject, axis)```

**np.mean()**

Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy.mean() function returns the arithmetic mean of elements in the array. If the axis is mentioned, it is calculated along it.

```python
np.mean(arrayObject, axis)```

**np.average()**

Weighted average is an average resulting from the multiplication of each component by a factor reflecting its importance. The numpy.average() function computes the weighted average of elements in an array according to their respective weight given in another array. The function can have an axis parameter. If the axis is not specified, the array is flattened.

Considering an array [1,2,3,4] and corresponding weights [4,3,2,1], the weighted average is calculated by adding the product of the corresponding elements and dividing the sum by the sum of weights.


```python 
weighted_average = (1*4+2*3+3*2+4*1)/(4+3+2+1)
np.average(arrayObject, weights)```

**Standard Deviation**

Standard deviation is the square root of the average of squared deviations from mean. The formula for standard deviation is as follows −

```python 
std = sqrt(mean(abs(x - x.mean())\*\*2))
np.std(arrayObject, axis)```

**Variance**

Variance is the average of squared deviations. In other words, the standard deviation is the square root of variance.


```python 
variance = mean(abs(x - x.mean())\*\*2)
np.var(arrayObject, axis)```


In [None]:
import numpy as np
c = np.zeros((3, 3), np.int)
print("\nCalling zeros() with shape(3,3):\n", c)

In [None]:
d = np.ones((3, 3), np.int)
print("\nCalling one() with shape(3,3):\n", d)

In [None]:
e = np.full((3, 3), 9, np.int)
print("\nCalling full() with shape(3,3) and fill value 9:\n", e)

In [None]:
f = np.eye(3, 3, 0, np.int)
print("\nCalling eye():\n", f)

In [None]:
values = np.array([[30, 40, 70], [80, 20, 10], [50, 90, 60]]) 

print('\nOriginal Array:\n', values)

In [None]:
### amin and amax
print('\nApplying amin() function:')
print(np.amin(values), '\n')

print('Applying amax() function:') 
print(np.amax(values), '\n')

print('Applying amax() function with axis = 1:')
print(np.amax(values, axis=0), '\n')

In [None]:
### Percentile
print('Applying percentile() function:')
print(np.percentile(values, 50), '\n')

print('Applying percentile() function along axis 1:')
print(np.percentile(values, 50, axis=1), '\n')

In [None]:
### Median
print('Applying median() function:')
print(np.median(values), "\n") 

print('Applying median() function along axis 0:')
print(np.median(values, axis=0), '\n') 

In [None]:
### Mean
print('Applying mean() function:') 
print(np.mean(values), '\n') 

print('Applying mean() function along axis 0:')
print(np.mean(values, axis=0), '\n')

In [None]:
### Average
print('Applying average() function:') 
print(np.average(values), '\n') 

In [None]:
print(np.average(values)) 
print()

In [None]:
### Standard Deviation
print('Applying std() function:') 
print(np.std(values), '\n') 

In [None]:
### Variance
print('Applying var() function:') 
print(np.var(values), '\n') 

### Practice

In [None]:
# below are the ticket fares of the first 10 passengers of each class
first_class = [71.2833,53.1,51.8625,26.55,35.5,263,27.7208,146.5208,82.1708,52]
second_class = [30.0708,16,13,26,13,10.5,21,41.5792,26,10.5]
third_class = [7.25,7.925,8.05,8.4583,21.075,11.1333,16.7,8.05,31.275,7.8542]

# Q1: Find the max fare of first class passengers
f = np.array(first_class)
s = np.array(second_class)
t = np.array(third_class)

for i in [f,s,t]:
    print(np.amax(i))

# dictionary mapping fare class to list of fares
ticket_class = {'1st Class' : first_class,
                '2nd Class' : second_class,
                '3rd Class' : third_class}

# Q2: Print the median, 95th percentile and standard deviation for each ticket class in the format below: 
# 'ticket class`: median = ___, 95th pct = ___, std dev = ___
# use of dictionary is up to you
# hint 1: convert list to numpy array and apply mean and median functions
# hint 2: for a dictionary called dict, dict.keys() will return a list of keys in that dictionary. 
# In case you'd like to use a for loop..


for x in ticket_class.keys():
    array = np.array(ticket_class[x])
    median = np.median(array)
    pct = np.percentile(array, 95)
    std = np.std(array)
    print(x, ": median =", median.round(1), ", 95th pct =", pct.round(2), ", std dev = ", std.round(0))
    

## Back to Pandas

What *is* Pandas? [Pandas](https://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python. Though you might have been thinking about adorable black and white pandas, this name was actually derived from the term *"panel data"*, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

Pandas is often used together with [Numpy](http://www.numpy.org/) and [scikit-learn](http://scikit-learn.org/stable/index.html).

In [None]:
# let's import pandas library
import pandas as pd

### Pandas Data structure - Series

A Series is a **one-dimensional** array with labeled axes.

In [None]:
# read data into series
fruits = pd.Series(["Apple", "Banana", "Mango"])

If you do a print, you'll see the index of the dataframe on the very first "column". You can then access a particular row with this index..

In [None]:
print(fruits)

In [None]:
fruits[2]    # returns 'Mango'

In [None]:
# You can pass in index to create your own index
fruits = pd.Series(["Apple", "Banana", "Mango"], index=['a', 'b', 'm'])
print(fruits)
fruits['m']  # returns 'Mango'

In [None]:
# Do this again, this time as a dictionary
fruits = pd.Series({'a': "Apple", 'b': "Banana", 'm': "Mango"})
print(fruits)
fruits['m']  # returns 'Mango'

### Pandas Data structure - DataFrame

A DataFrame is a **2-dimensional** table with labeled axes. It acts like a dict-like container for Series objects.

The easiest way to create it is to pass it into dictionary, with the key as the header and a corresponding list as the data. The list for each key should be of the same length.

In [None]:
# Let's build a DataFrame of fruit inventory across all floors
fruit_inventory = {
    "fruit": ["Apple", "Banana", "Mango"],
    "fruit_count_2F": [75, 150, 250],
    "fruit_count_6F": [80, 120, 150],
    "fruit_count_8F": [50, 100, 200],
    "fruit_count_9F": [100, 200, 350],    
  }
df1 = pd.DataFrame(fruit_inventory)
df1

In [None]:
# You get a Series if you access a DataFrame's index
df1.fruit # you can also use df1["fruit"]

In [None]:
df1['fruit']

In [None]:
# let's create a second dataframe
fruit_property = {
    "fruit": ["Apple", "Banana", "Mango", "Papaya"],
    "fruit_color": ["red", "yellow", "yellow", "orange"],
  }
df2 = pd.DataFrame(fruit_property)
df2

#### Merging DataFrames

In [None]:
# Left merge DataFrames df1 and df2 using the 'fruit' column as key 
# This only retains rows which exist in df1
print(df1)
print(df2)
fruit_list1 = df1.merge(df2, on='fruit', how='left')
fruit_list1

In [None]:
# Outer merge DataFrames df1 and df2 using the 'fruit' column as key 
# All rows are kept, and empty values are filled with NaN (Not a Number)

fruit_list2 = df1.merge(df2, on="fruit", how="outer")
fruit_list2

#### Removing NaN

In [None]:
# Let's replace NaN value in fruit count columns with 0
fruit_list2 = fruit_list2.fillna(0)
fruit_list2

#### Summing by row in DataFrame

In [None]:
# Now let's create a column to sum the total fruit count across all floors

# data.loc[<row selection>, <column selection>]
# In this case, we're applying across all rows for a selected column range
# axis=1 means row-wise, while axis=0 means column-wise

fruit_list2['fruit_count_total']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].sum(axis=1)
fruit_list2

#### Summing by column in DataFrame

In [None]:
list(fruit_list2)

In [None]:
print(fruit_list2['fruit_count_total'].sum())

# Let's count all the fruits on each floor

# First, create a list of column names which include 'F' 
columns = [column for column in list(fruit_list2) if 'F' in column]

# Next, sum within each column for the column names which include 'F'
fruit_count_floor_total = fruit_list2[columns].sum()
fruit_count_floor_total

#### Average 

In [None]:
# Finding the average of a particular column
fruit_list2['fruit_count_2F'].mean()

In [None]:
# Finding the average a couple of columns, per row
# the default axis in the .mean() is 0. if we change it to 1, it will compute by the row instead of the column

fruit_list2['fruit_count_avg']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].mean(axis=1)
fruit_list2.sort_values('fruit_count_avg', ascending=False)

#### Sorting rows in DataFrame

In [None]:
# Great! Now let's sort the fruit count total in descending order
fruit_list2.sort_values('fruit_count_total', ascending=False)

In [None]:
# sort and display in descending order, only fruits with count exceeding 500 

fruit_list2[fruit_list2['fruit_count_total']>500].sort_values('fruit_count_total', ascending=False)

#### Counting the number of occurences in a particular column

In [None]:
fruit_list2['fruit_color'].value_counts()

#### Retrieving certain records in DataFrame

In [None]:
# Let's retrieve rows containing only yellow fruits
fruit_list2.loc[fruit_list2['fruit_color'] == "yellow"]

In [None]:
# Get unique fruit colors
print(fruit_list2['fruit_color'].unique())

#### Grouping and summing in DataFrames

In [None]:
# Group fruit count by fruit color
fruit_list2.groupby('fruit_color').sum()

#### Adding and deleting columns in DataFrames

In [None]:
# Add a new column 'in_stock' based on function applied to column 'fruit_count_total'
# 'in_stock' = True only if fruit_count_total > 0
# apply lets us apply our own customized function to the data

fruit_list2['in_stock'] = fruit_list2['fruit_count_total'].apply(lambda x: True if x > 0 else False)
fruit_list2

In [None]:
# Let's delete 'in_stock' column
del fruit_list2['in_stock']
fruit_list2

### Wait... What the f*** is lambda??????

Lambda is something unique (? I think so!) to Python. It is **simply another way of writing functions** in Python _(WHY THEY MAKE LIFE SO TOUGH?!)_. 

We generally use lambda functions when we require a nameless function for a short period of time. Let's see..

In [None]:
# Let's say you want to calculate the cube of a number..
# Normal function method
def cube_normal(number):
    return number**3

print(cube_normal(4))

# Lambda method
cube_lambda = lambda number: number**3

print(cube_lambda(4))

In [None]:
# Let's try to do what we did just now with both methods..
# Using a normal function
def greater_than_0(x):
    if x >0:
        out = True
    else: out = False
    return out

fruit_list2['in_stock'] = fruit_list2['fruit_count_total'].apply(greater_than_0)
print(fruit_list2['in_stock'])

# Using Lambda
fruit_list2['in_stock'] = fruit_list2['fruit_count_total'].apply(lambda x: True if x > 0 else False)
print(fruit_list2['in_stock'])

The difference is simply that in the first way, you need to define a function prior to doing the task. This function will stay in your workspace until you ever clear it. In the second way, the function is created on the fly, without the need for defining it. As you can see, it's use is really just as a nameless function that we probably only need once.

I can't cover everything, but do read more [here](https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/)!

#### Aggregating figures

In [None]:
# Calculate the sum of total fruit count
print(fruit_list2.agg({'fruit_count_total':'sum'}))

# Calculate the sum and median of total fruit count
print(fruit_list2.agg({'fruit_count_total':['sum','median']}))

# Calculate the max minus the min of fruit_count_2F & fruit_count_6F
print(fruit_list2.agg({'fruit_count_2F' : lambda x: max(x) - min(x),
                      'fruit_count_6F' : lambda x: max(x) - min(x)}))

### Now it's your turn to use Pandas to explore the Titanic data

In [None]:
# Ready? Let's read the Biggest Loser csv data into a dataframe 
import pandas as pd
df = pd.read_csv("../data/titanic.csv")

#### Retrieve top records in DataFrame

In [None]:
df.head(5)

#### Retrieve bottom records in DataFrame

In [None]:
# That's better. How about the bottom 3 rows?
df.tail(3)

### Now you have everything you need to code the following by yourselves!

In [None]:
# Your turn now! Let's print unique ports of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
df['Embarked'].unique()

In [None]:
# Try retrieving data for passengers who embarked from Southampton
df[df["Embarked"] == "S"]
df.loc[df["Embarked"] == "S"]

# The above 2 methods gives the same result, but using .loc is more explicit. I can't quite explain this fully,
# but would recommend you to use .loc


What is the difference between `df['Embarked']` & `df.Embarked?`

The `.` is only there for convenience, you can access a pre-existing column _(provided the column name is appropriate - does not have ".", " ", etc.)_ but you cannot create a new column with this notation, since the column does not exist yet.

Also, what's the difference between `df[df.Embarked == "S"]` & `df.loc[df.Embarked == "S"]`?

If you are selecting a single column, a list of columns, or a slice of rows then there is no difference. However, `[]` does not allow you to select a single row, a list of rows or a slice of columns. More importantly, if your selection involves both rows and columns, then assignment becomes problematic. 

`df[1:3]['A'] = 5`

This selects rows 1 and 2, and then selects column 'A' of the returning object and assign value 5 to it. The problem is, the returning object might be a copy (_please think of this as `inplace = True` vs `inplace = False`_) so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of this assignment is

`df.loc[1:3, 'A'] = 5`

With `.loc`, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (`df.loc[:, 'C':'F']`), select a single row (`df.loc[5]`), and select a list of rows (`df.loc[[1, 2, 5]]`). 

Answer from [stackoverflow](https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte)





In [None]:
# Find the overall number of males and females on Titanic
df['Sex'].value_counts()

In [None]:
# Now add a column called "family_members" which takes the sum of the number of siblings/spouse & parents/children

# Method 1 (simpler..)
df['family_members'] = df.SibSp + df.Parch

# Method 2 (why u do diz)
df['family_members'] = df.loc[:,["SibSp", "Parch"]].sum(axis = 1)
df.head()

In [None]:
# delete the columns SibSp & Parch

# axis = 1 for column wise dropping, inplace to make the change permanent
df.drop(['SibSp','Parch'], axis = 1, inplace =True)

In [None]:
# sort and display in descending order, passengers with the most family members on the ship
df.sort_values('family_members',ascending = False).head()

In [None]:
# calculate the overall survival rate
# hint: use the .agg({}) and write a lambda function on 'Survived'. Take sum()/len() to get the rate

# sum(x) returns the sum of the column 'Survived', while len(x) returns the length of the column 'Survived'
df.agg({'Survived':lambda x: sum(x)/len(x)})

In [None]:
# calculate the survival rate by ticket class

df.groupby('Pclass').agg({'Survived':lambda x: sum(x)/len(x)})

In [None]:
# calculate the survival rate by ticket class and by sex

df.groupby(['Pclass','Sex']).agg({'Survived':lambda x: sum(x)/len(x)})

In [None]:
# calculate the median fare by ticket class, for male passengers

# Method 1
print(df[df['Sex'] == "male"].groupby('Pclass')['Fare'].median())

# Method 2
print(df[df['Sex'] == "male"].groupby('Pclass').agg({'Fare':['median', 'sum', 'count']}))

In [None]:
# calculate the survival rate by port of embarkation, for passengers aged 30 and above

df[df['Age'] >= 30].groupby('Embarked').agg({'Survived':lambda x: sum(x)/len(x)})

## With that, we have come to the end of our very short introduction to Python! 
## **Better news: NO HW TODAY!**

I hope you've learnt a lot from this and find more opportunities to use and practice it! The journey to using Python at work has only just started and we're are only just at the tip of the tip of the iceberg :D

Topics to work on in future:
- Managing time series data (datetime)
- Plotting (matplotlib, seaborn, plotly, bokeh)
- Advanced data wrangling (pandas, data.table (coming soon!))
- Natural language processing (gensim, nltk)
- Machine learning & AI techniques (sci-kit learn, tensorflow, pytorch)