## Lesson 4 Overview

1. Numpy
2. Pandas

## Let's load today's lesson!

### Open Azure Notebooks library 

Go to https://notebooks.azure.com -> Sign in if needed -> Select **python-intro**

### Update lesson file to latest version

Select **New** -> **From URL** -> input https://raw.githubusercontent.com/kelvnt/python-intro/master/Lesson4.ipynb (URL is available in **Lesson4.ipynb**) -> Click outside input then select **Upload** (overwrite if needed)

### Open Jupyter lab

From your browser's bookmark or **Run** -> Change browser URL path from **/nb/tree** to **/nb/lab**

Select **Lesson4.ipynb**

### NumPy

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. One can perform mathematical and logical operations on arrays more efficiently and effortlessly. 

**How you import numpy package to your python program:**
```python 
import numpy as np```
#### Numpy Arrays:

Numpy array is another datatype of python language like the list. Numpy arrays can be created using python list objects.

```python 
np.array(listObject)```

In [2]:
import numpy as np
py_list = [1, 3, 10, 50]
np_list = np.array(py_list)
print(np_list)

and Numpy assumes values in the array to a single type like booleans, int etc. Trying to create a numpy array with different types, will result in converting all the values to a single type. like the string in below case.

In [18]:
import numpy as np
py_list = [1, 'alice', True] 
np_list = np.array(py_list)  # Numpy arrays contain only one type
print(np_list)               # outputs: ['1' 'alice' 'True']

**What is so special about Numpy arrays ?**

Suppose, you have the list of heights and weights of family members(as below) and asked to calculate the BMI of each. By using lists, one has to iterate each of it and which would be inefficient and tiresome to write.


In [15]:
# BMI formula is weight/height ** 2

heights = [1.73, 1.68, 1.71, 1.89, 1.79]
weights = [65.4, 59.2, 63.6, 88.4, 68.7]

# find BMI

np_heights = np.array(heights)
np_weights = np.array(weights)

bmi = np_weights / np_heights ** 2

print(bmi)

Let's practice with numpy array operations:

In [12]:
import numpy as np
no_of_seats_per_class = [100, 80, 60, 90]  # list of seats available per class
no_of_students_per_class = [82, 34, 49, 88]  # list of students attended per class

# find no. of seats vaccant per class using numpy array operations
# START HERE

**Yes, Numpy makes list operations less expensive and more efficient.** 

Keep in mind, the same applies for arthematic operators too. For example, 

In [5]:
py_list = [1, 2, 3]
print("sum of lists:", py_list + py_list)   # Results in a new list of 6 elements

np_list = np.array(py_list)
print("sum of np arrays:", np_list + np_list)  # Results in a array of 3 elements which are the sum of elements of the same index


#### Numpy Subsetting 


**slicing** 


Basic slicing is an extension of Python's basic concept of slicing to n dimensions. A Python slice object is constructed by giving a **start**, **stop**, and **step** parameters to the **built-in slice function**. This slice object is passed to the array to extract a part of an array.

```python 
sliceObject = slice(start, stop, step)

arrayObject[sliceObject]```

alternatively, we can use the slicing parameters seperated by colons.

```python 
arrayObject[start:stop:step] ```


In [7]:
arrayObject = np.arange(1, 20)  # arange method will generate values within the interval. syntax arange(start,stop,step)
print("Original array:", arrayObject)
sliceObject = slice(2, 10, 4)

sliceArray = arrayObject[sliceObject]
print("Using slice Object: ", sliceArray)

paramArray = arrayObject[2:10:4]
print("Using slice parameters: ", paramArray)


**Using array of booleans**

Using a conditional statment with square brackets results of np_array results in array of booleans for respective indices and one can filter the array using this result as below.
```python
    booleanArray = conditional operation on arrayObject
    filteredObject = arrayObject[booleanArray]```

Here, we are creating a new array from the result of data comparison.

In [31]:
import numpy as np
arrayObject = np.arange(0, 100, 10)
print("arrayObject:", arrayObject)

booleanArray = arrayObject > 30  # this conditional statement produces the array of booleans
print("booleanArray:", booleanArray)

filteredObject = arrayObject[booleanArray]
print("filteredObject:", filteredObject)

Let's practice with numpy array of booleans:

In [11]:
import numpy as np
list_of_class = ["Politics", "Engineering", "Biology", "Maths"]
no_of_seats_per_class = [100, 80, 60, 90]  # list of seats available per class
no_of_students_per_class = [82, 80, 49, 90]  # list of students attended per class

# print class names with full attendance using numpy array operations and array of booleans:
# START HERE

#### Numpy library

**Commonly used methods** 

**arrayObject.shape**

Returns the tuple of array dimensions

```python 
arrayObject.shape

print(np.array([[8,9],[9,2]]).shape) #Prints: (2, 2)```

**np.zeros**

Create an array of all zeros
   
   ```python
   np.zeros(shape, dtype, order)```

**np.ones**

Create an array of all ones
   
   ```python
   np.ones(shape, dtype, order)```

**np.full**

Create an array of all the same value
   
   ```python
   np.full(shape, fill_value, dtype, order)```

**np.eye**

Return a 2-D array with ones on the diagonal and zeros elsewhere.
   
   ```python
   np.eye(n_rows, n_columns, diagonal_index, dtype, order)```

**np.amin() and numpy.amax()**

These functions return the minimum and the maximum from the elements in the given array along the specified axis.

```python 
   np.amin(arrayObject)

   np.amax(arrayObject)```

**np.percentile()**

Percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The function numpy.percentile() takes the following arguments.

```python
np.percentile(arrayObject, q, axis)```

>arrayObject : input array
>q    : The percentile to compute must be between 0-100
>axis : The axis along which the percentile is to be calculated

**np.median()**

Median is defined as the value separating the higher half of a data sample from the lower half. The numpy.median() function is used as shown in the following program.

```python
np.median(arrayObject, axis)```

**np.mean()**

Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy.mean() function returns the arithmetic mean of elements in the array. If the axis is mentioned, it is calculated along it.

```python
np.mean(arrayObject, axis)```

**np.average()**

Weighted average is an average resulting from the multiplication of each component by a factor reflecting its importance. The numpy.average() function computes the weighted average of elements in an array according to their respective weight given in another array. The function can have an axis parameter. If the axis is not specified, the array is flattened.

Considering an array [1,2,3,4] and corresponding weights [4,3,2,1], the weighted average is calculated by adding the product of the corresponding elements and dividing the sum by the sum of weights.


```python 
weighted_average = (1*4+2*3+3*2+4*1)/(4+3+2+1)
np.average(arrayObject, weights)```

**Standard Deviation**

Standard deviation is the square root of the average of squared deviations from mean. The formula for standard deviation is as follows −

```python 
std = sqrt(mean(abs(x - x.mean())\*\*2))
np.std(arrayObject, axis)```

**Variance**

Variance is the average of squared deviations. In other words, the standard deviation is the square root of variance.


```python 
variance = mean(abs(x - x.mean())\*\*2)
np.var(arrayObject, axis)```


In [34]:
import numpy as np
c = np.zeros((3, 3), np.int)
print("\nCalling zeros() with shape(3,3):\n", c)

d = np.ones((3, 3), np.int)
print("\nCalling one() with shape(3,3):\n", d)

e = np.full((3, 3), 9, np.int)
print("\nCalling full() with shape(3,3) and fill value 9:\n", e)

f = np.eye(3, 3, 0, np.int)
print("\nCalling eye():\n", f)

values = np.array([[30, 40, 70], [80, 20, 10], [50, 90, 60]]) 

print('\nOriginal Array:\n', values)

### amin and amax
print('\nApplying amin() function:')
print(np.amin(values), '\n')

print('Applying amax() function:') 
print(np.amax(values), '\n')

print('Applying amax() function with axis = 1:')
print(np.amax(values, axis=0), '\n')

### Percentile
print('Applying percentile() function:')
print(np.percentile(values, 50), '\n')

print('Applying percentile() function along axis 1:')
print(np.percentile(values, 50, axis=1), '\n')

### Median
print('Applying median() function:')
print(np.median(values), "\n") 

print('Applying median() function along axis 0:')
print(np.median(values, axis=0), '\n') 

### Mean
print('Applying mean() function:') 
print(np.mean(values), '\n') 

print('Applying mean() function along axis 0:')
print(np.mean(values, axis=0), '\n')

### Average
print('Applying average() function:') 
print(np.average(values), '\n') 

### Standard Deviation
print('Applying std() function:') 
print(np.std(values), '\n') 

### Variance
print('Applying var() function:') 
print(np.var(values), '\n') 

### Challenges (Optional)

In [13]:

p1_daily_steps = [11980, 10437, 17616, 24586, 16136, 13700, 39812, 9195, 12855, 11309, 23606, 11848, 6120, 6254, 8754, 6469, 8849, 9911, 7709, 534, 13465, 7341, 11230, 7878, 11029, 8790, 9006, 21942]
p2_daily_steps = [22935, 13399, 25098, 29581, 26121, 12805, 16073, 15124, 16011, 6198, 10026, 10909, 14468, 4828, 11207, 7133, 14977, 13746, 12267, 9364, 1061, 6075, 11188, 11472, 10150, 13023, 6769, 10165]
p3_daily_steps = [16272, 17231, 20595, 17047, 15216, 42590, 10969, 20687, 19170, 12703, 17192, 12865, 10960, 9105, 16019, 12646, 10042, 13353, 16072, 41673, 13425, 11262, 7801, 6666, 5276, 11353, 5344, 6282]
p4_daily_steps = [10233, 16120, 12897, 24680, 13060, 20489, 10230, 25565, 10029, 12696, 13938, 9475, 5297, 8573, 9857, 15341, 9482, 11649, 5804, 11080, 6245, 7611, 8401, 5596, 6491, 7637, 7610, 9130]
p5_daily_steps = [14126, 14110, 10111, 20440, 21416, 16989, 25371, 21539, 23045, 20043, 20328, 12058, 12004, 3301, 9789, 6671, 7893, 9589, 10459, 5091, 6329, 6784, 6543, 17984, 13588, 11077, 7856, 20897]

team = { # a dictionary mapping initial to list of daily steps
    'G': p1_daily_steps, 
    'I': p2_daily_steps, 
    'P': p3_daily_steps, 
    'T': p4_daily_steps, 
    'D': p5_daily_steps
}

# Print minimum and maximum no. of steps walked by each person in this format: 'G's (minimum, maximum) steps per day are: (534, 39812)'
# Hint 1: convert list to numpy array and apply amin and amax functions. 
# Hint 2: If a dictionary is called dict, dict.keys() will return the list of keys in that dictionary.

# Start here




# Print the person with heighest average and his total no. of steps
# Hint 1: use numpy's average() and sum() functions

# Start here




# Print the percentage increment between first two days and last two days of each team member
# Hint 1: (total_steps_in_first_2_days - total_steps_in_last_2_days)*100/total_steps
# Start here







# Homework
Solution can be found in Homework/Solutions/Lesson3.ipynb

In [None]:
# Easy mode
daily_steps = [11980, 10437, 17616, 24586, 16136, 13700, 39812, 9195, 12855, 11309, 23606, 11848, 6120, 6254, 8754, 6469, 8849, 9911, 7709, 534, 13465, 7341, 11230, 7878, 11029, 8790, 9006, 21942]

# Find the total number of steps completed by this participant

# Start here



# Find the average daily steps of this participant. Hint: len(list) will return the number of items inside the list (the length of the list)

# Start here

In [None]:
# Challenging mode
p1_daily_steps = [11980, 10437, 17616, 24586, 16136, 13700, 39812, 9195, 12855, 11309, 23606, 11848, 6120, 6254, 8754, 6469, 8849, 9911, 7709, 534, 13465, 7341, 11230, 7878, 11029, 8790, 9006, 21942]
p2_daily_steps = [22935, 13399, 25098, 29581, 26121, 12805, 16073, 15124, 16011, 6198, 10026, 10909, 14468, 4828, 11207, 7133, 14977, 13746, 12267, 9364, 1061, 6075, 11188, 11472, 10150, 13023, 6769, 10165]
p3_daily_steps = [16272, 17231, 20595, 17047, 15216, 42590, 10969, 20687, 19170, 12703, 17192, 12865, 10960, 9105, 16019, 12646, 10042, 13353, 16072, 41673, 13425, 11262, 7801, 6666, 5276, 11353, 5344, 6282]
p4_daily_steps = [10233, 16120, 12897, 24680, 13060, 20489, 10230, 25565, 10029, 12696, 13938, 9475, 5297, 8573, 9857, 15341, 9482, 11649, 5804, 11080, 6245, 7611, 8401, 5596, 6491, 7637, 7610, 9130]
p5_daily_steps = [14126, 14110, 10111, 20440, 21416, 16989, 25371, 21539, 23045, 20043, 20328, 12058, 12004, 3301, 9789, 6671, 7893, 9589, 10459, 5091, 6329, 6784, 6543, 17984, 13588, 11077, 7856, 20897]

team = { # a dictionary mapping initial to list of daily steps
    'G': p1_daily_steps, 
    'I': p2_daily_steps, 
    'P': p3_daily_steps, 
    'T': p4_daily_steps, 
    'D': p5_daily_steps
}

# Print the total steps of each person in this format: 'Total steps of G is 100000000'
# Hint 1: create a function to get total steps per person. 
# Hint 2: If a dictionary is called dict, dict.keys() will return the list of keys in that dictionary.

# Start here



# Print the total steps of the team

# Start here



# Print the initial of the person with total steps of more than 400k

# Start here


## Let's talk about Pandas

What *is* Pandas? [Pandas](https://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python. Though you might have been thinking about adorable black and white pandas, this name was actually derived from the term *"panel data"*, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

Pandas is often used together with [Numpy](http://www.numpy.org/) and [scikit-learn](http://scikit-learn.org/stable/index.html).

In [None]:
# let's import pandas library
import pandas as pd

### Pandas Data structure - Series

A Series is a **one-dimensional** array with labeled axes.

In [None]:
# read data into series where index starts from 0
fruits = pd.Series(["Apple", "Banana", "Mango"])
fruits[2]    # returns 'Mango'

In [None]:
# You can pass in index to create your own index
fruits = pd.Series(["Apple", "Banana", "Mango"], index=['a', 'b', 'm'])
fruits['m']  # returns 'Mango'

In [None]:
# Do this again, this time as a dictionary
fruits = pd.Series({'a': "Apple", 'b': "Banana", 'm': "Mango"})
fruits['m']  # returns 'Mango'

### Pandas Data structure - DataFrame

A DataFrame is a **2-dimensional** table with labeled axes. It acts like a dict-like container for Series objects.

In [None]:
# Let's build a DataFrame of fruit inventory across all floors
fruit_inventory = {
    "fruit": ["Apple", "Banana", "Mango"],
    "fruit_count_2F": [75, 150, 250],
    "fruit_count_6F": [80, 120, 150],
    "fruit_count_8F": [50, 100, 200],
    "fruit_count_9F": [100, 200, 350],    
  }
df1 = pd.DataFrame(fruit_inventory)
df1

In [None]:
# You get a Series if you access a DataFrame's index
df1.fruit  # you can also use df1["fruit"]

In [None]:
# let's create a second dataframe
fruit_property = {
    "fruit": ["Apple", "Banana", "Mango", "Papaya"],
    "fruit_color": ["red", "yellow", "yellow", "orange"],
  }
df2 = pd.DataFrame(fruit_property)
df2

#### Merging DataFrames

In [None]:
# Left merge DataFrames df1 and df2 using the 'fruit' column as key 
# This only retains rows which exist in df1

fruit_list1 = df1.merge(df2, on='fruit', how='left')
fruit_list1

In [None]:
# Outer merge DataFrames df1 and df2 using the 'fruit' column as key 
# All rows are kept, and empty values are filled with NaN (Not a Number)

fruit_list2 = df1.merge(df2, on="fruit", how="outer")
fruit_list2

#### Removing NaN and casting numbers to integer

In [None]:
# Let's replace NaN value in fruit count columns with 0
fruit_list2 = fruit_list2.fillna(0)

# Let's remove the decimal places that appeared in fruit count after merging DataFrames
fruit_list2 = fruit_list2.apply(pd.to_numeric, downcast='integer', errors='ignore')

fruit_list2

#### Summing by row in DataFrame

In [None]:
# Now let's create a column to sum the total fruit count across all floors

# data.loc[<row selection>, <column selection>]
# In this case, we're applying across all rows for a selected column range
# axis=1 means row-wise, while axis=0 means column-wise

fruit_list2['fruit_count_total']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].sum(axis=1)
fruit_list2

#### Summing by column in DataFrame

In [None]:
# Let's count all the fruits on each floor

# First, create a list of column names which include 'F' 
columns = [column for column in list(fruit_list2) if 'F' in column]

# Next, sum within each column for the column names which include 'F'
fruit_count_floor_total = fruit_list2[columns].sum()
fruit_count_floor_total

#### Average 

In [None]:
# Let's find out the average fruit count for each floor

fruit_list2['fruit_count_avg']= fruit_list2.loc[:, 'fruit_count_2F':'fruit_count_9F'].mean(axis=1).round(0)
fruit_list2.sort_values('fruit_count_avg', ascending=False)

#### Sorting rows in DataFrame

In [None]:
# Great! Now let's sort the fruit count total in descending order
fruit_list2.sort_values('fruit_count_total', ascending=False)

In [None]:
# sort and display in descending order, only fruits with count exceeding 500 

fruit_list2[fruit_list2['fruit_count_total']>500].sort_values('fruit_count_total', ascending=False)

#### Retrieving certain records in DataFrame

In [None]:
# Let's retrieve rows containing only yellow fruits
fruit_list2.loc[fruit_list2['fruit_color'] == "yellow"]

In [None]:
# Get unique fruit colors
print(fruit_list2['fruit_color'].unique())

#### Grouping and summing in DataFrames

In [None]:
# Group fruit count by fruit color
fruit_list2.groupby('fruit_color').sum()

#### Adding and deleting columns in DataFrames

In [None]:
# Add a new column 'in_stock' based on function applied to column 'fruit_count_total'
# 'in_stock' = True only if fruit_count_total > 0

fruit_list2['in_stock'] = fruit_list2['fruit_count_total'].apply(lambda x: True if x > 0 else False)
fruit_list2

In [None]:
# Let's delete 'in_stock' column
del fruit_list2['in_stock']
fruit_list2

### Now it's your turn to use Pandas to explore Biggest Loser data 

In [None]:
# Ready? Let's read the Biggest Loser csv data into a dataframe 

df = pd.read_csv('Biggest Loser 2018.csv')
df

#### Retrieve top records in DataFrame

In [None]:
# Too many rows! How can we read the top 5?
df.head(5)

#### Retrieve bottom records in DataFrame

In [None]:
# That's better. How about the bottom 3 rows?
df.tail(3)

### Now you have everything you need to code the following by yourselves!

In [None]:
# Your turn now! Let's print unique team names



In [None]:
# Great! Now retrieve data for only team_no 1



In [None]:
# Now add a column called member_tot_steps to store each person's total count across the challenge
# sort rows by total steps in descending order; display top 3




In [None]:
# sort and display in descending order, individuals who have exceeded 350K steps




In [None]:
# add a column called member_avg_steps to store average steps for each member across the challenge
# sort individuals by average steps in descending order; display top 3



In [None]:
# sum total daily steps for each team into a new DataFrame called team_df




In [None]:
# In DataFrame team_df, remove a column e.g. team captain




In [None]:
# In DataFrame team_df, add a column called team_tot_steps 
# This will store total steps for each team for entire challenge
# Sort teams by total steps in descending order; display top 3



In [None]:
# In DataFrame team_df, sort & display in descending order, teams who have exceeded 1 million steps


