# Uvod v NumPy


## Understanding Data Types in Python



```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```


```python
# Python code
result = 0
for i in range(100):
    result += i
```


In [1]:
x = 4 

In [2]:
x ='four'

### A Python Integer Is More Than Just an Integer



```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```



<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Integer Memory Layout">

### A Python List Is More Than Just a List



In [3]:
L = list(range(10))

In [4]:
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [5]:
type(L[0])

int

In [6]:
L2 = [str(c) for c in L]

In [7]:
L2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [8]:
type(L2[0])

str

In [9]:
L3 = [True, "2", 3.0, 4]

In [10]:
[type(item) for item in L3]

[bool, str, float, int]


<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Array Memory Layout">

### Fixed-Type Arrays in Python



In [11]:
import array

In [12]:
L = list(range(10))
A = array.array('i', L)

In [13]:
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## How Vectorization Makes Code Faster


<p><img alt="Translating Python code to bytecode" src="https://s3.amazonaws.com/dq-content/289/bytecode.svg"></p>


<table>
<thead>
<tr>
<th>Language Type</th>
<th>Example</th>
<th>Time taken to write program</th>
<th>Control over program performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>High-Level</td>
<td>Python</td>
<td>Low</td>
<td>Low</td>
</tr>
<tr>
<td>Low-Level</td>
<td>C</td>
<td>High</td>
<td>High</td>
</tr>
</tbody>
</table>



<p><img alt="For loop to sum rows" src="https://s3.amazonaws.com/dq-content/289/for_loop.svg"></p>

In [14]:
my_numbers = [[6,5], [1,3], [5,6]]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
    
print(sums)    

[11, 4, 11]




<p><img alt="Unvectorized operation" src="https://s3.amazonaws.com/dq-content/289/unvectorized.svg"></p>


<p><img alt="Vectorized operation" src="https://s3.amazonaws.com/dq-content/289/vectorized.svg"></p>



## Numpy library

[Dokumentacija](http://www.numpy.org/)

In [15]:
import numpy as np

## Introduction to Ndarrays

<img alt="Dimensional Arrays" src="./images/one_dim.svg">

In [16]:
data_ndarray = np.array([5,10,15,20])

In [17]:
data_ndarray

array([ 5, 10, 15, 20])

In [18]:
type(data_ndarray)

numpy.ndarray

<img alt="Dimensional Arrays" src="./images/Two_Dim.svg">

In [19]:
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)

In [20]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

## Priprava podatkov za delo

<div>

<ul>
<li><code>pickup_year</code>: The year of the trip.</li>
<li><code>pickup_month</code>: The month of the trip (January is <code>1</code>, December is <code>12</code>).</li>
<li><code>pickup_day</code>: The day of the month of the trip.</li>
<li><code>pickup_location_code</code>: The airport or <a target="_blank" href="https://en.wikipedia.org/wiki/Boroughs_of_New_York_City">borough</a> where the trip started.</li>
<li><code>dropoff_location_code</code>: The airport or borough where the trip finished.</li>
<li><code>trip_distance</code>: The distance of the trip in miles.</li>
<li><code>trip_length</code>: The length of the trip in seconds.</li>
<li><code>fare_amount</code>: The base fare of the trip, in dollars.</li>
<li><code>total_amount</code>: The total amount charged to the passenger, including all fees, tolls and tips.</li>
</ul>

</div>

    pickup_year,pickup_month,pickup_day,pickup_dayofweek
    2016,1,1,5
    2016,1,1,5
    2016,1,1,5
    2016,1,1,5

In [21]:
!head -n 7 data/nyc_taxis.csv

pickup_year,pickup_month,pickup_day,pickup_dayofweek,pickup_time,pickup_location_code,dropoff_location_code,trip_distance,trip_length,fare_amount,fees_amount,tolls_amount,tip_amount,total_amount,payment_type
2016,1,1,5,0,2,4,21.00,2037,52.00,0.80,5.54,11.65,69.99,1
2016,1,1,5,0,2,1,16.29,1520,45.00,1.30,0.00,8.00,54.30,1
2016,1,1,5,0,2,6,12.70,1462,36.50,1.30,0.00,0.00,37.80,2
2016,1,1,5,0,2,6,8.70,1210,26.00,1.30,0.00,5.46,32.76,1
2016,1,1,5,0,2,6,5.56,759,17.50,1.30,0.00,0.00,18.80,2
2016,1,1,5,0,4,2,21.45,2004,52.00,0.80,0.00,52.80,105.60,1


    # our list of lists is stored as data_list
    data_ndarray = np.array(data_list)

In [22]:
import csv
import numpy as np

with open('data/nyc_taxis.csv', 'r') as f:
    taxi_list = list(csv.reader(f))


In [23]:
print(taxi_list[:3])

[['pickup_year', 'pickup_month', 'pickup_day', 'pickup_dayofweek', 'pickup_time', 'pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount', 'payment_type'], ['2016', '1', '1', '5', '0', '2', '4', '21.00', '2037', '52.00', '0.80', '5.54', '11.65', '69.99', '1'], ['2016', '1', '1', '5', '0', '2', '1', '16.29', '1520', '45.00', '1.30', '0.00', '8.00', '54.30', '1']]


In [24]:
taxi_list = taxi_list[1:]

In [25]:
converted_taxi_list = []

for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

In [26]:
print(converted_taxi_list[:3])

[[2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 4.0, 21.0, 2037.0, 52.0, 0.8, 5.54, 11.65, 69.99, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 1.0, 16.29, 1520.0, 45.0, 1.3, 0.0, 8.0, 54.3, 1.0], [2016.0, 1.0, 1.0, 5.0, 0.0, 2.0, 6.0, 12.7, 1462.0, 36.5, 1.3, 0.0, 0.0, 37.8, 2.0]]


In [27]:
taxi = np.array(converted_taxi_list)

## Array Shapes

In [28]:
print(taxi)

[[2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 0.000e+00 3.780e+01 2.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [29]:
type(taxi)

numpy.ndarray

In [30]:
data2 = [[1,2,3,4], [5,6,7,8]]
data2 = np.array(data2)
data2.shape

(2, 4)

<div class="alert alert-block alert-info">
<b>Vaja:</b> Assign the shape of taxi to taxi_shape. Print the result.</div>

In [31]:
taxi.shape

(89560, 15)

In [32]:
taxi.size

1343400

In [33]:
taxi.ndim

2

In [34]:
taxi.itemsize

8

In [35]:
taxi.nbytes

10747200

In [36]:
taxi.nbytes/1024/1024

10.24932861328125

## Selecting and Slicing Rows and Items from ndarrays

<img alt="Dimensional Arrays" src="./images/selection_rows.svg">

    ndarray[row_index,column_index]

    # or if you want to select all
    # columns for a given set of rows
    ndarray[row_index]

<img alt="Dimensional Arrays" src="./images/selection_item.svg">

In [40]:
test = np.random.randint(0, 10, (5,5))

In [41]:
test

array([[9, 3, 5, 2, 4],
       [7, 6, 8, 8, 1],
       [6, 7, 7, 8, 1],
       [5, 9, 8, 9, 4],
       [3, 0, 3, 5, 0]])

In [42]:
test[0]

array([9, 3, 5, 2, 4])

In [43]:
test[-1]

array([3, 0, 3, 5, 0])

In [44]:
test[1:3]

array([[7, 6, 8, 8, 1],
       [6, 7, 7, 8, 1]])

In [45]:
row2_3 = test[1:3]

In [46]:
row2_3

array([[7, 6, 8, 8, 1],
       [6, 7, 7, 8, 1]])

In [47]:
test[2:]

array([[6, 7, 7, 8, 1],
       [5, 9, 8, 9, 4],
       [3, 0, 3, 5, 0]])

In [49]:
test[2,3]

8

<div class="alert alert-block alert-info">
<b>Vaja:</b> From the taxi ndarray:
- Select the row at index 0. Assign it to row_0.
- Select every column for the rows at indexes 391 to 500 inclusive. Assign them to rows_391_to_500.
- Select the item at row index 21 and column index 5. Assign it to row_21_column_5.</div>

In [52]:
row_0 = taxi[0]

In [53]:
row_0

array([2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
       4.000e+00, 2.100e+01, 2.037e+03, 5.200e+01, 8.000e-01, 5.540e+00,
       1.165e+01, 6.999e+01, 1.000e+00])

In [54]:
rows_391_to_500 = taxi[391:501]

In [56]:
#rows_391_to_500

In [57]:
row_21_column_5 = taxi[21, 5]

## Selecting Columns and Custom Slicing ndarrays

<img alt="Dimensional Arrays" src="./images/selection_columns_updated.svg">

<img alt="Dimensional Arrays" src="./images/selection_1darray_updated.svg">

<img alt="Dimensional Arrays" src="./images/selection_2darray_updated.svg">

In [58]:
column_test = np.random.randint(0, 10, (5,5))

In [60]:
column_test

array([[2, 3, 8, 1, 3],
       [3, 3, 7, 0, 1],
       [9, 9, 0, 4, 7],
       [3, 2, 7, 2, 0],
       [0, 4, 5, 5, 6]])

In [61]:
column_test[:, 3]

array([1, 0, 4, 2, 5])

In [62]:
column_test[:, 1:3]

array([[3, 8],
       [3, 7],
       [9, 0],
       [2, 7],
       [4, 5]])

In [63]:
column_test[:, [1, 3, 4]]

array([[3, 1, 3],
       [3, 0, 1],
       [9, 4, 7],
       [2, 2, 0],
       [4, 5, 6]])

In [64]:
column_test[2, 1:4]

array([9, 0, 4])

In [65]:
column_test[1:, 4]

array([1, 7, 0, 6])

In [66]:
column_test[1:4, :3]

array([[3, 3, 7],
       [9, 9, 0],
       [3, 2, 7]])

<div class="alert alert-block alert-info">
<b>Vaja:</b> From the taxi ndarray:
- Select every row for the columns at indexes 1, 4, and 7. Assign them to columns_1_4_7.
- Select the columns at indexes 5 to 8 inclusive for the row at index 99. Assign them to row_99_columns_5_to_8.
- Select the rows at indexes 100 to 200 inclusive for the column at index 14. Assign them to rows_100_to_200_column_14.</div>

In [67]:
columns_1_4_7 = taxi[:, [1,4,7]]

In [68]:
row_99_columns_5_to_8 = taxi[99, 5:9]

In [69]:
rows_100_to_200_column_14 = taxi[100:201, 14]

## Vector Math

In [70]:
my_numbers = [[6,5], [9,1], [2,4], [7,14], [8,6]]

In [71]:
sums = []
for row in my_numbers:
    row_sums = row[0] + row[1]
    sums.append(row_sums)
    
print(sums)

[11, 10, 6, 21, 14]


In [72]:
my_numbers = np.array(my_numbers)

In [73]:
my_numbers

array([[ 6,  5],
       [ 9,  1],
       [ 2,  4],
       [ 7, 14],
       [ 8,  6]])

In [75]:
sums = my_numbers[:,0] + my_numbers[:,1]

In [76]:
sums

array([11, 10,  6, 21, 14])

<div class="alert alert-block alert-info">
<b>Vaja:</b> 
Use vector addition to add fare_amount and fees_amount. Assign the result to fare_and_fees.
After you have run your code, use the variable inspector below the code box to inspect the variables.</div>

In [77]:
# taxi stolpec index 9
fare_amount = taxi[:,9]

In [78]:
# taxi stolpec index 10
fees_amount = taxi[:,10]

In [79]:
fare_and_fees = fare_amount + fees_amount

In [80]:
fare_and_fees

array([52.8, 46.3, 37.8, ..., 52.8, 35.8, 49.3])

    miles_per_hour = distance_in_miles / length_in_hours

In [82]:
trip_dstance = taxi[:, 7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600

In [83]:
trip_mph = trip_dstance / trip_length_hours

In [84]:
trip_mph

array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])



<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table lists the arithmetic operators implemented in NumPy:</p>
<table>
<thead><tr>
<th>Operator</th>
<th>Equivalent ufunc</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>+</code></td>
<td><code>np.add</code></td>
<td>Addition (e.g., <code>1 + 1 = 2</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.subtract</code></td>
<td>Subtraction (e.g., <code>3 - 2 = 1</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.negative</code></td>
<td>Unary negation (e.g., <code>-2</code>)</td>
</tr>
<tr>
<td><code>*</code></td>
<td><code>np.multiply</code></td>
<td>Multiplication (e.g., <code>2 * 3 = 6</code>)</td>
</tr>
<tr>
<td><code>/</code></td>
<td><code>np.divide</code></td>
<td>Division (e.g., <code>3 / 2 = 1.5</code>)</td>
</tr>
<tr>
<td><code>//</code></td>
<td><code>np.floor_divide</code></td>
<td>Floor division (e.g., <code>3 // 2 = 1</code>)</td>
</tr>
<tr>
<td><code>**</code></td>
<td><code>np.power</code></td>
<td>Exponentiation (e.g., <code>2 ** 3 = 8</code>)</td>
</tr>
<tr>
<td><code>%</code></td>
<td><code>np.mod</code></td>
<td>Modulus/remainder (e.g., <code>9 % 4 = 1</code>)</td>
</tr>
</tbody>
</table>

</div>
</div>


In [85]:
trip_mph = np.divide(trip_dstance, trip_length_hours)

In [86]:
trip_mph

array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])

## Calculating Statistics For 1D ndarrays

In [87]:
trip_mph.min()

0.0

<ul>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min"><code>ndarray.min()</code> to calculate the minimum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html"><code>ndarray.max()</code> to calculate the maximum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean"><code>ndarray.mean()</code> to calculate the mean or average value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum"><code>ndarray.sum()</code> to calculate the sum of the values</a></li>
</ul>

In [88]:
np.min(trip_mph)

0.0


<p></p><center><img alt="Method syntax" src="https://s3.amazonaws.com/dq-content/289/Method_syntax.svg"></center><p></p>


<div class="alert alert-block alert-info">
<b>Vaja:</b> Use the ndarray.max() method to calculate the maximum value of trip_mph. Assign the result to mph_max.
Use the ndarray.mean() method to calculate the average value of trip_mph. Assign the result to mph_mean.</div>

In [89]:
trip_mph.max()

82800.0

In [90]:
trip_mph.mean()

32.24258580925573

<div>

<table>
<thead>
<tr>
<th>Calculation</th>
<th>Function Representation</th>
<th>Method Representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Calculate the minimum value of <code>trip_mph</code></td>
<td><code>np.min(trip_mph)</code></td>
<td><code>trip_mph.min()</code></td>
</tr>
<tr>
<td>Calculate the maximum value of <code>trip_mph</code></td>
<td><code>np.max(trip_mph)</code></td>
<td><code>trip_mph.max()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Mean">mean average</a> value of <code>trip_mph</code></td>
<td><code>np.mean(trip_mph)</code></td>
<td><code>trip_mph.mean()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Median">median average</a> value of <code>trip_mph</code></td>
<td><code>np.median(trip_mph)</code></td>
<td>There is no ndarray median method</td>
</tr>
</tbody>
</table>
</div>

## Calculating Statistics For 2D ndarrays

<img alt="Dimensional Arrays" src="./images/array_method_axis_none.svg">

<img alt="Dimensional Arrays" src="./images/array_method_axis_1.svg">

<img alt="Dimensional Arrays" src="./images/array_method_axis_0.svg">



<p><img alt="The axis parameter" src="https://s3.amazonaws.com/dq-content/289/axis_param.svg"></p>


In [91]:
np.ones((4,1))

array([[1.],
       [1.],
       [1.],
       [1.]])

In [92]:
np.ones((1,4))

array([[1., 1., 1., 1.]])

In [94]:
test = np.random.randint(0,9, (5,5))

In [95]:
test

array([[3, 7, 5, 5, 0],
       [1, 5, 3, 0, 5],
       [0, 1, 2, 4, 2],
       [0, 3, 2, 0, 7],
       [5, 0, 2, 7, 2]])

In [96]:
test.max()

7

In [97]:
test.max(axis=0)

array([5, 7, 5, 7, 7])

In [98]:
test.max(axis=1)

array([7, 5, 4, 7, 7])

In [99]:
taxi_first_five = taxi[:5]

In [100]:
fare_components = taxi_first_five[:, 9:13]

In [101]:
fare_components

array([[52.  ,  0.8 ,  5.54, 11.65],
       [45.  ,  1.3 ,  0.  ,  8.  ],
       [36.5 ,  1.3 ,  0.  ,  0.  ],
       [26.  ,  1.3 ,  0.  ,  5.46],
       [17.5 ,  1.3 ,  0.  ,  0.  ]])

In [102]:
fare_components.sum(axis=1)

array([69.99, 54.3 , 37.8 , 32.76, 18.8 ])

In [104]:
taxi_first_five[:,13]

array([69.99, 54.3 , 37.8 , 32.76, 18.8 ])

## Reading CSV files with NumPy

<p>Below is information about selected columns from the data set:</p>
<ul>
<li><code>pickup_year</code>: The year of the trip.</li>
<li><code>pickup_month</code>: The month of the trip (January is <code>1</code>, December is <code>12</code>).</li>
<li><code>pickup_day</code>: The day of the month of the trip.</li>
<li><code>pickup_location_code</code>: The airport or <a target="_blank" href="https://en.wikipedia.org/wiki/Boroughs_of_New_York_City">borough</a> where the the trip started.</li>
<li><code>dropoff_location_code</code>: The airport or borough where the the trip finished.</li>
<li><code>trip_distance</code>: The distance of the trip in miles.</li>
<li><code>trip_length</code>: The length of the trip in seconds.</li>
<li><code>fare_amount</code>: The base fare of the trip, in dollars.</li>
<li><code>total_amount</code>: The total amount charged to the passenger, including all fees, tolls and tips.</li>
</ul>


In [110]:
!head data/nyc_taxis.csv

pickup_year,pickup_month,pickup_day,pickup_dayofweek,pickup_time,pickup_location_code,dropoff_location_code,trip_distance,trip_length,fare_amount,fees_amount,tolls_amount,tip_amount,total_amount,payment_type
2016,1,1,5,0,2,4,21.00,2037,52.00,0.80,5.54,11.65,69.99,1
2016,1,1,5,0,2,1,16.29,1520,45.00,1.30,0.00,8.00,54.30,1
2016,1,1,5,0,2,6,12.70,1462,36.50,1.30,0.00,0.00,37.80,2
2016,1,1,5,0,2,6,8.70,1210,26.00,1.30,0.00,5.46,32.76,1
2016,1,1,5,0,2,6,5.56,759,17.50,1.30,0.00,0.00,18.80,2
2016,1,1,5,0,4,2,21.45,2004,52.00,0.80,0.00,52.80,105.60,1
2016,1,1,5,0,2,6,8.45,927,24.50,1.30,0.00,6.45,32.25,1
2016,1,1,5,0,2,6,7.30,731,21.50,1.30,0.00,0.00,22.80,2
2016,1,1,5,0,2,5,36.30,2562,109.50,0.80,11.08,10.00,131.38,1


https://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

In [105]:
taxi = np.genfromtxt('data/nyc_taxis.csv', delimiter=',')

In [106]:
taxi_shape = taxi.shape

In [107]:
taxi_shape

(89561, 15)

In [108]:
taxi.dtype

dtype('float64')

In [109]:
print(taxi)

[[      nan       nan       nan ...       nan       nan       nan]
 [2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [112]:
taxi = taxi[1:]

In [118]:
taxi = np.genfromtxt('data/nyc_taxis.csv', delimiter=',', skip_header=1)

In [119]:
print(taxi)

[[2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 0.000e+00 3.780e+01 2.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [115]:
taxi.shape

(89560, 15)

`usecols : sequence, optional` Which columns to read, with 0 being the first. For example, usecols = (1, 4, 5) will extract the 2nd, 5th and 6th columns.

## Datatypes

In [120]:
x = np.array([1, 2])
print(x.dtype) 
print(x.nbytes)

int64
16


In [121]:
x = np.array([1.0, 2.0])
print(x.dtype) 
print(x.nbytes)

float64
16


In [122]:
x = np.array([1, 2], dtype=np.int32)
print(x.dtype) 
print(x.nbytes)

int32
8


In [123]:
x = np.array([1, 2], dtype=np.int8)
print(x.dtype)
print(x.nbytes)

int8
2


In [124]:
x = np.array([1, 2], dtype='int8')
print(x.dtype)
print(x.nbytes)

int8
2


<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Data type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bool_</code></td>
<td>Boolean (True or False) stored as a byte</td>
</tr>
<tr>
<td><code>int_</code></td>
<td>Default integer type (same as C <code>long</code>; normally either <code>int64</code> or <code>int32</code>)</td>
</tr>
<tr>
<td><code>intc</code></td>
<td>Identical to C <code>int</code> (normally <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>intp</code></td>
<td>Integer used for indexing (same as C <code>ssize_t</code>; normally either <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>int8</code></td>
<td>Byte (-128 to 127)</td>
</tr>
<tr>
<td><code>int16</code></td>
<td>Integer (-32768 to 32767)</td>
</tr>
<tr>
<td><code>int32</code></td>
<td>Integer (-2147483648 to 2147483647)</td>
</tr>
<tr>
<td><code>int64</code></td>
<td>Integer (-9223372036854775808 to 9223372036854775807)</td>
</tr>
<tr>
<td><code>uint8</code></td>
<td>Unsigned integer (0 to 255)</td>
</tr>
<tr>
<td><code>uint16</code></td>
<td>Unsigned integer (0 to 65535)</td>
</tr>
<tr>
<td><code>uint32</code></td>
<td>Unsigned integer (0 to 4294967295)</td>
</tr>
<tr>
<td><code>uint64</code></td>
<td>Unsigned integer (0 to 18446744073709551615)</td>
</tr>
<tr>
<td><code>float_</code></td>
<td>Shorthand for <code>float64</code>.</td>
</tr>
<tr>
<td><code>float16</code></td>
<td>Half precision float: sign bit, 5 bits exponent, 10 bits mantissa</td>
</tr>
<tr>
<td><code>float32</code></td>
<td>Single precision float: sign bit, 8 bits exponent, 23 bits mantissa</td>
</tr>
<tr>
<td><code>float64</code></td>
<td>Double precision float: sign bit, 11 bits exponent, 52 bits mantissa</td>
</tr>
<tr>
<td><code>complex_</code></td>
<td>Shorthand for <code>complex128</code>.</td>
</tr>
<tr>
<td><code>complex64</code></td>
<td>Complex number, represented by two 32-bit floats</td>
</tr>
<tr>
<td><code>complex128</code></td>
<td>Complex number, represented by two 64-bit floats</td>
</tr>
</tbody>
</table>

</div>

In [125]:
x = np.array([189, 22, 128, -129], dtype=np.int8)

In [126]:
x

array([ -67,   22, -128,  127], dtype=int8)

## Boolean Indexing

### Boolean Arrays

In [127]:
True, False

(True, False)

In [128]:
type(3.5) == float

True

In [129]:
5 > 6

False

In [130]:
np.array([2,4,5,9]) + 10 

array([12, 14, 15, 19])

In [131]:
np.array([2,4,5,9]) < 5 

array([ True,  True, False, False])

<div class="alert alert-block alert-info">
Use vectorized boolean operations to:
<li> Evaluate whether the elements in array a are less than 3. Assign the result to a_bool.</li> 
<li> Evaluate whether the elements in array b are equal to "blue". Assign the result to b_bool.</li> 
<li>  Evaluate whether the elements in array c are greater than 100. Assign the result to c_bool.</li> </div>

In [132]:
a = np.array([1, 2, 3, 4, 5])

In [133]:
a < 3

array([ True,  True, False, False, False])

In [134]:
b = np.array(["blue", "blue", "red", "blue"])

In [135]:
b == 'blue'

array([ True,  True, False,  True])

In [136]:
c = np.array([80.0, 103.4, 96.9, 200.3])

In [137]:
c > 100

array([False,  True, False,  True])

### Boolean Indexing with 1D ndarrays

In [138]:
c = np.array([80.0, 103.4, 96.9, 200.3])

In [143]:
c_bool = c > 100

In [144]:
c_bool

array([False,  True, False,  True])

In [145]:
result = c[c_bool]

In [146]:
result

array([103.4, 200.3])

In [147]:
month = taxi[:,1]

In [148]:
january_bool = month == 1

In [149]:
january = month[january_bool]

In [151]:
january.shape[0]

13481

<div class="alert alert-block alert-info">
Calculate the number of rides in the taxi ndarray that are from February:
<li> Create a boolean array, february_bool, that evaluates whether the items in pickup_month are equal to 2.</li> 
<li> Use the february_bool boolean array to index pickup_month. Assign the result to february.</li> 
<li> Use the ndarray.shape attribute to find the number of items in february. Assign the result to february_rides.</li> </div>

In [153]:
february_bool = month == 2
february = month[february_bool]
february_rides = february.shape[0]
february_rides

13333

In [155]:
pickup_month = taxi[:,1]
rides = []

for month in range(1,13):
    month_bool = pickup_month == month
    month_array = pickup_month[month_bool]
    month_rides = month_array.shape[0]
    rides.append(month_rides)

print(rides)

[13481, 13333, 15547, 14810, 16650, 15739, 0, 0, 0, 0, 0, 0]


### Boolean Indexing with 2D ndarrays

<img alt="Dimensional Arrays" src="./images/bool_dims_updated.svg">

In [156]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

In [158]:
trip_mph.max()

82800.0

In [159]:
trips_over_20000 = taxi[trip_mph > 20000, 5:9]

In [160]:
trips_over_20000

array([[ 2. ,  2. , 23. ,  1. ],
       [ 2. ,  2. , 19.6,  1. ],
       [ 2. ,  2. , 16.7,  2. ],
       [ 3. ,  3. , 17.8,  2. ],
       [ 2. ,  2. , 17.2,  2. ],
       [ 3. ,  3. , 16.9,  3. ],
       [ 2. ,  2. , 27.1,  4. ]])

<div class="alert alert-block alert-info">
<b>Vaja: </b>Ceate a boolean array, tip_bool, that determines which rows have values for the tip_amount column of more than 50. Use the tip_bool array to select all rows from taxi with values tip amounts of more than 50, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to top_tips. </div>

In [161]:
tip_amount = taxi[:, 12]

In [162]:
top_tips = taxi[tip_amount > 50, 5:14]

In [164]:
top_tips[:2]

array([[4.000e+00, 2.000e+00, 2.145e+01, 2.004e+03, 5.200e+01, 8.000e-01,
        0.000e+00, 5.280e+01, 1.056e+02],
       [3.000e+00, 4.000e+00, 9.200e+00, 1.041e+03, 2.700e+01, 1.300e+00,
        5.540e+00, 6.000e+01, 9.384e+01]])

## Assigning Values

### Assigning Values in ndarrays

    ndarray[location_of_values] = new_value

In [165]:
a = np.array(['red','blue','black','blue','purple'])

In [166]:
a[0] = 'orange'

In [167]:
a

array(['orange', 'blue', 'black', 'blue', 'purple'], dtype='<U6')

In [168]:
a[3:] = 'pink'

In [169]:
a

array(['orange', 'blue', 'black', 'pink', 'pink'], dtype='<U6')

In [170]:
a[3:] = ['pink', 'orange']

In [171]:
a

array(['orange', 'blue', 'black', 'pink', 'orange'], dtype='<U6')

In [172]:
ones = np.ones((3,5))

In [173]:
ones

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [174]:
ones[1, 2] = 99

In [175]:
ones

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1., 99.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [176]:
ones[:,2] = 0

In [177]:
ones

array([[1., 1., 0., 1., 1.],
       [1., 1., 0., 1., 1.],
       [1., 1., 0., 1., 1.]])

In [178]:
ones[0] = 42

In [179]:
ones

array([[42., 42., 42., 42., 42.],
       [ 1.,  1.,  0.,  1.,  1.],
       [ 1.,  1.,  0.,  1.,  1.]])

<div class="alert alert-block alert-info">
<b>Vaja: </b>To help you practice without making changes to our original array, we have used the ndarray.copy() method to make taxi_modified, a copy of our original for these exercises.
<li> The value at column index 5 (pickup_location) of row index 28214 is incorrect. Use assignment to change this value to 1 in the taxi_modified ndarray.</li> 
<li> The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the taxi_modified ndarray.</li> 
<li> The values at column index 7 (trip_distance) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the taxi_modified ndarray to the mean value for that column.</li> </div>

In [180]:
taxi_modified = taxi.copy()

In [181]:
taxi_modified[28214,5] = 1

In [182]:
taxi_modified[:,0] = 16

In [183]:
taxi_modified[1800:1802,7] = taxi_modified[:,7].mean()

### Assignment Using Boolean Arrays

In [190]:
a2 = np.array([1,2,3,4,5])

In [191]:
a2[((a2>2) & (a2 < 5))] = 99

In [192]:
a2

array([ 1,  2, 99, 99,  5])

- & = AND
- | = OR
- ~ = NOT

<div class="alert alert-block alert-info">
<b>Vaja: </b>We again used the ndarray.copy() method to make taxi_copy, a copy of our original for this exercise.
<li> Select the fourteenth column (index 13) in taxi_copy. Assign it to a variable named total_amount.</li> 
<li> For rows where the value of total_amount is less than 0, use assignment to change the value to 0.</li> 
 </div>

In [193]:
taxi_copy = taxi.copy()

In [194]:
total_amount = taxi_copy[:, 13]

In [195]:
total_amount[total_amount < 0] = 0

<hr>

In [202]:
b = np.linspace(1,9, num=9, dtype=np.int)
b = np.reshape(b, (3,3))
c = b.copy()

In [201]:
b

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [203]:
b[b>4] = 99

In [204]:
b

array([[ 1,  2,  3],
       [ 4, 99, 99],
       [99, 99, 99]])

In [205]:
c

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [206]:
c[c[:,1] > 2, 1] = 99

In [207]:
c

array([[ 1,  2,  3],
       [ 4, 99,  6],
       [ 7, 99,  9]])

    array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value

## Adding Rows and Columns to ndarrays

[numpy.concatenate()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html)

In [208]:
ones = np.ones((2,3))

In [209]:
ones

array([[1., 1., 1.],
       [1., 1., 1.]])

In [210]:
zeros = np.zeros(3)

In [211]:
zeros

array([0., 0., 0.])

In [212]:
combined = np.concatenate([ones, zeros], axis=0)

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

In [213]:
ones.shape

(2, 3)

In [214]:
zeros.shape

(3,)

mi rabimo: (1, 3)

In [215]:
zeros_2d = np.expand_dims(zeros, axis=0)

In [216]:
zeros_2d

array([[0., 0., 0.]])

In [217]:
zeros_2d.shape

(1, 3)

In [219]:
combined = np.concatenate([ones, zeros_2d], axis=0)

In [220]:
print(combined)

[[1. 1. 1.]
 [1. 1. 1.]
 [0. 0. 0.]]


## Computation on NumPy Arrays: Universal Functions


### The Slowness of Loops



In [223]:
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [224]:
big_array = np.random.randint(1,100, size=1000000)
%timeit compute_reciprocals(big_array)

7.78 s ± 1.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Introducing UFuncs (Universal functions)


In [226]:
print(compute_reciprocals(values))
print(1.0 / values)

[0.16666667 1.         0.25       0.25       0.125     ]
[0.16666667 1.         0.25       0.25       0.125     ]


In [227]:
%timeit (1.0 / big_array)

1.96 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Subarrays as no-copy views



In [228]:
r = np.ones((4,4))

In [229]:
r

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [230]:
r2 = r[:2, :2 ]

In [231]:
r2

array([[1., 1.],
       [1., 1.]])

In [232]:
r2[:] = 0

In [233]:
r2

array([[0., 0.],
       [0., 0.]])

In [234]:
r

array([[0., 0., 1., 1.],
       [0., 0., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

## Copying Data



In [235]:
r = np.ones((4,4))

In [236]:
r3 = r[:2,:2].copy()

In [237]:
r3

array([[1., 1.],
       [1., 1.]])

In [238]:
r3[:] = 2578

In [239]:
r3

array([[2578., 2578.],
       [2578., 2578.]])

In [240]:
r

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

## Primer: Which is the most popular airport?

In [242]:
jfk = taxi[taxi[:, 6] == 2]
jfk.shape[0]

11832

In [243]:
laguardia = taxi[taxi[:, 6] == 3]
laguardia.shape[0]

16602

In [244]:
newak = taxi[taxi[:, 6] == 5]
newak.shape[0]

63

## Primer: Calculating Statistics for Trips on Clean Data

In [245]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

In [246]:
cleaned_taxi = taxi[trip_mph < 100]

In [247]:
# mean_distance
cleaned_taxi[:, 7].mean()

12.666396599932893

In [249]:
# mean_length v min
cleaned_taxi[:, 8].mean() / 60

37.325060955150434

In [250]:
# mean_total_amount
cleaned_taxi[:, 13].mean()

48.98131853260262