# Python for scientific computing
This chapter will cover topics on scientific computing in Python. We'll start by explaining the difference between NumPy arrays and lists. We'll define why the former ones suit better for complex calculations. Next, we'll cover some useful techniques to manipulate with pandas DataFrames. Finally, we'll do some data visualization using scatterplots, histograms, and boxplots.

# 1. What is the difference between a NumPy array and a list?
## 1.1 Incorrect array initialization
If you pass the following list `[1, (2, 3), 4]` to initialize a NumPy array, what would be the data type of the stored values?

#### Possible Answers:
1. `int64`
2. `float64`
3. `<U1`
4. `object`

In [1]:
import numpy as np

np_list = np.array([1, (2, 3), 4])
np_list.dtype

dtype('O')

__Answer:__ Since we have different data types in our list, all the elements are considered as an object. (4)

## 1.2 Accessing subarrays
Let's access elements in NumPy arrays! Your task is to convert a square two-dimensional array `square` of `size` size to a list created by following a spiral pattern:

<img src="_datasets/spiral.png"></img>

Traversing the matrix in spiral way

Rather than simply accessing certain slices, you will define a more general solution using a `for` loop (the solution should work for all the square two-dimensional arrays of odd size).

You will need the `reversed()` function, which reverses an Iterable.

### Instructions:
* Convert each part marked by a red arrow to a list.
* Convert each part marked by a green arrow to a list.
* Convert each part marked by a blue arrow to a list.
* Convert each part marked by a magenta arrow to a list.

In [3]:
square = np.array([[1, 2, 3, 4, 5],
                   [6, 7, 8, 9, 10],
                   [11, 12, 13, 14, 15],
                   [16, 17, 18, 19, 20],
                   [21, 22, 23, 24, 25]])

size = square.shape[0]

print(square)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]
 [21 22 23 24 25]]


In [4]:
spiral = []

for i in range(0, size):
    # Convert each part marked by a red arrow to a list
    spiral += list(square[i, i:size-i])
    # Convert each part marked by a green arrow to a list
    spiral += list(square[i+1:size-i, size-i-1])
    # Convert each part marked by a blue arrow to a list
    spiral += list(reversed(square[size-i-1, i:size-i-1]))
    # Convert each part marked by a magenta arrow to a list
    spiral += list(reversed(square[i+1:size-i-1, i]))
        
print(spiral)

[1, 2, 3, 4, 5, 10, 15, 20, 25, 24, 23, 22, 21, 16, 11, 6, 7, 8, 9, 14, 19, 18, 17, 12, 13]


## 1.3 Operations with NumPy arrays
The following blocks of code create new lists given input lists `input_list1`, `input_list2`, `input_list3` (you can check their values in the console). If you had analogous NumPy arrays with the same values `input_array1`, `input_array2`, `input_array3` (you can check their values in the console), how would you create similar output as NumPy arrays using the knowledge on broadcasting, accessing element in NumPy arrays, and performing element-wise operations?

#### Instructions:
* Substitute the code in the block 1 given the `input_array1`.
* Substitute the code in the block 2 given `input_array2`.
* Substitute the code in the block 3 given `input_array3`.

In [14]:
input_list1 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
input_array1 = np.array(input_list1)

# Block 1
list(map(lambda x: [5*i for i in x], input_list1))

[[5, 10, 15], [20, 25, 30], [35, 40, 45]]

In [15]:
# Substitute the code in the block 1 given the input_array1
output_array1 = input_array1 * 5
output_array1

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

In [16]:
input_list2 = list(range(0, 10))
input_array2 = np.array(input_list2)

# Block 2
list(filter(lambda x: x % 2 == 0, input_list2))

[0, 2, 4, 6, 8]

In [17]:
output_array2 = [x for x in input_list2 if x%2==0]
output_array2

[0, 2, 4, 6, 8]

In [18]:
input_list3 = [[1, 2], [3, 4], [5, 6]]
input_array3 = np.array(input_list3)

# Block 3
[[i*i for i in j] for j in input_list3]

[[1, 4], [9, 16], [25, 36]]

In [19]:
# Substitute the code in the block 3 given the input_array3
output_array3 = input_array3 * input_array3
output_array3

array([[ 1,  4],
       [ 9, 16],
       [25, 36]])

# 2. How to use the .apply() method on a DataFrame?
## 2.1 Simple use of .apply()
Let's get some handful experience with `.apply()`!

You are given the full `scores` dataset containing students' performance as well as their background information.

Your task is to define the `prevalence()` function and apply it to the `groups_to_consider` columns of the `scores` DataFrame. This function should retrieve the most prevalent group/category for a given column (e.g. if the most prevalent category in the `lunch` column is `standard`, then `prevalence()` should return `standard`).

The `reduce()` function from the `functools` module is already imported.

Tip: `pd.Series` is an Iterable object. Therefore, you can use standard operations on it.

### Instructions:
* Create a tuple list with unique items from passed object `series` and their counts.
* Extract a tuple with the highest counts using `reduce()`.
* Return the item with the highest counts.
* Apply the prevalence function on the `scores` DataFrame using columns specified in `groups_to_consider`.

In [21]:
import pandas as pd
from functools import reduce

scores = pd.read_csv('_datasets/exams.csv')
groups_to_consider = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']
scores.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group E,associate's degree,free/reduced,none,74,86,82
1,female,group D,some college,free/reduced,none,44,49,53
2,male,group D,some high school,free/reduced,none,54,46,43
3,female,group B,bachelor's degree,standard,none,88,95,92
4,male,group C,master's degree,standard,completed,85,81,81


In [22]:
def prevalence(series):
    vals = list(series)
    # Create a tuple list with unique items and their counts
    itms = [(x, vals.count(x)) for x in set(series)]
    # Extract a tuple with the highest counts using reduce()
    res = reduce(lambda x, y: x if x[1] > y[1] else y, itms)
    # Return the item with the highest counts
    return res[0]

# Apply the prevalence function on the scores DataFrame
result = scores[groups_to_consider].apply(prevalence)
result

gender                               female
race/ethnicity                      group C
parental level of education    some college
lunch                              standard
test preparation course                none
dtype: object

Actually, we can use the `mode()` method instead defining the `prevalence()` function and using it instead of the `apply()` method (`scores[groups_to_consider].mode()`). However, it's always good to practice several approaches.

## 2.2 Additional arguments
Let's use additional arguments in the `.apply()` method!

Your task is to create two new columns in `scores`:
* `mean` is the row-wise mean value of the math score, reading score and writing score
* `rank` defines how high the mean score is:
    * `'high'` if the mean value $>90$
    * `'medium'` if the mean value $>60$ but $\ge 90$
    * `'low'` if the mean value $\le 60$
To accomplish this task, you'll need to define the function `rank` that, given a series, returns a list with two values: the mean of the series and a string defined by the aforementioned rule.

#### Instructions:
* Calculate the mean of the input series.
* Return the mean and its rank as a list.
* Insert the output of `rank()` into new columns of `scores`.

In [23]:
def rank(series):
    # Calculate the mean of the input series
    mean = np.mean(series)
    # Return the mean and its rank as a list
    if mean > 90:
        return [mean, 'high']
    if (mean <= 90) & (mean > 60):
        return [mean, 'medium']
    return [mean, 'low']

# Insert the output of rank() into new columns of scores
cols = ['math score', 'reading score', 'writing score']
scores[['mean', 'rank']] = scores[cols].apply(rank, axis=1, result_type='expand')
print(scores[['mean', 'rank']].head())

        mean    rank
0  80.666667  medium
1  48.666667     low
2  47.666667     low
3  91.666667    high
4  82.333333  medium


## 2.3 Functions with additional arguments
Let's add some arguments to the function definition!

Numeric data in `scores` represent students' performance scaled between 0 and 100. Your task is to rescale this data to an arbitrary range between `low` and `high`. Rescaling should be done in a linear fashion, i.e. for any data point x in a column:

$x_{new} = x\frac{high - low}{100}+ low$

To do rescaling, you'll have to define the function `rescale()`. Remember, the operation written above can be applied to Series directly. After defining the function, you'll have to apply it to the specified columns of `scores`.

#### Instructions:
* Define the expression to rescale input series.
* Rescale the data in `cols` to lie between 1 and 10.

In [24]:
def rescale(series, low, high):
   # Define the expression to rescale input series
   return series * (high-low)/100 + low

# Rescale the data in cols to lie between 1 and 10
cols = ['math score', 'reading score', 'writing score'] 
scores[cols] = scores[cols].apply(rescale, args=[1, 10])
print(scores[cols].head())

   math score  reading score  writing score
0        7.66           8.74           8.38
1        4.96           5.41           5.77
2        5.86           5.14           4.87
3        8.92           9.55           9.28
4        8.65           8.29           8.29
