Reference: https://www.dataquest.io/blog/numpy-tutorial-python/#using-numpy-to-read-in-files

**NumPy** is a commonly used Python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like *scikit-learn*, that use NumPy under the hood.

In this tutorial, we'll walk through using NumPy to analyze data on **wine quality**. The data contains information on various attributes of wines, such as *pH* and *fixed acidity*, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we'll try to figure out more about the perceived quality of wine.

### Data
The wines we'll be analyzing are from the Minho region of Portugal.

The data was downloaded from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php), and is available [here](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

We will use the **red wine** data: *winequality-red.csv*

In [None]:
import csv

In [None]:
# open the csv file, use ";" as delimiter to split the records
# get all the rows from the file and assign the results to wines
with open("winequality-red.csv", 'r') as f:
    wines = list(csv.reader(f, delimiter=";"))
    
# print out the first 2 rows    
print(wines[:2])

The data has been read into a list of lists. The first row contains column headers. Each row after the header row represents a wine.

### Numpy 2-Dimensional Arrays
With NumPy, we work with multidimensional arrays. For now, we'll focus on matrices (2-dimensional arrays).

In [None]:
import numpy as np

In [None]:
# read data
with open("winequality-red.csv", 'r') as f:
    wines = list(csv.reader(f, delimiter=";"))
    
# elminiate headers (first row)
wines = np.array(wines[1:], dtype=np.float)
wines

In [None]:
# an alternative way: np.genfromtxt
# specify the keyword argument delimiter = ";"
# also skip the header row
wines = np.genfromtxt("winequality-red.csv", delimiter=";", skip_header=1)

In [None]:
# check the number of rows and columns 
wines.shape

You can also create **empty** or **random** NumPy arrays. 

It's useful to create an array with all zero elements in cases when you need an array of fixed size, but don't have any values for it yet.

Creating arrays full of random numbers can be useful when you want to quickly test your code with sample arrays.

In [None]:
# example 1 - an empty array
empty_array = np.zeros((3,4))
empty_array

In [None]:
# example 2 - a random array
np.random.rand(3,4)

### Indexing NumPy Arrays

We can use array indexing to select individual elements, groups of elements, or entire rows and columns. One important thing to keep in mind is that just like Python lists, NumPy is **zero-indexed**. The indices of the first row and first column are 0.

In [None]:
wines[:1]

In [None]:
wines[0, 3]

### Slicing NumPy Arrays
A colon indicates that we want to select all the elements from the starting index up to but not including the ending index.

In [None]:
wines[0, :3]

We can select an entire column, from the first to the last, by just using the colon (:), with no starting or ending indices. 

In [None]:
# this will select the entire fourth column:
wines[: , 3]

### Assigning Values

Indexing can also be used to assign values to certain elements in arrays.

In [None]:
print (wines[1, 5])
wines[1, 5] = 10
print (wines[1, 5])

wines[1, 5] = 25

### N-Dimensional NumPy Arrays

One way to think of a n-dimensional NumPy array is as **a list of lists of lists**. Let's say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:

In [None]:
# year one earnings

year_one = [
    [500,505,490],
    [810,450,678],
    [234,897,430],
    [560,1023,640]
]

print ("First quarter earning is:", year_one[0],
       "\nJanuary earning is:", year_one[0][0])

In [None]:
# now add another year of earnings
earnings = [
            [
                [500,505,490],
                [810,450,678],
                [234,897,430],
                [560,1023,640]
            ],
            [
                [600,605,490],
                [345,900,1000],
                [780,730,710],
                [670,540,324]
            ]
          ]

We now need three indices to retrieve an element.

In [None]:
earnings = np.array(earnings)

print ("January earning is:", earnings[0][0][0])
earnings.shape

### NumPy Data Types

You can find the data type of a NumPy array by accessing the **dtype** property:

In [None]:
wines.dtype

You can find a full-list of datatypes here: https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

And here are some important ones:
* *float* -- numeric floating point data.
* *int* -- integer data.
* *string* -- character data.
* *object* -- Python objects.

### Converting Data Types

You can use the **numpy.ndarray.astype** method to convert an array to a different type. The method will actually copy the array, and return a new array with the specified data type. For instance, we can convert wines to the int data type:

In [None]:
wines.astype(int)

In [None]:
wines.astype(int).dtype

### Single Array Math

If you do any of the basic mathematical operations with an array and a value, it will apply the operation to each of the elements in the array.

In [None]:
wines[:,11] + 10

In [None]:
# notice the difference?
wines[:,11] + 10
print (wines[:, 11])

wines[:, 11] += 10
print (wines[:, 11])

In [None]:
wines[:, 11] -= 10
print (wines[:, 11])

In [None]:
wines[:,11] * 2

### Multiple Array Math
It's also possible to do mathematical operations between arrays. This will apply the operation to pairs of elements.

In [None]:
wines[:,11]

In [None]:
wines[:,11] + wines[:,11]

In [None]:
wines[:,11] * wines[:,11]

### Broadcasting

Unless the arrays that you're operating on are the exact same size, it's not possible to do elementwise operations. In cases like this, NumPy performs broadcasting to try to match up elements.

In [None]:
array_one = np.array(
    [
        [1,2],
        [3,4]
    ]
)
array_two = np.array([4,5])

array_one + array_two

As you can see, *array_two* has been broadcasted across each row of *array_one*. Here's an example with our wines data:

In [None]:
rand_array = np.random.rand(12)
wines + rand_array

Elements of *rand_array* are broadcast over each row of *wines*, so the first column of *wines* has the first value in *rand_array* added to it, and so on.

### NumPy Array Methods

NumPy also has several methods that you can use for more complex calculations on arrays.

In [None]:
wines[:,11].sum()

If we call *sum* across the *wines* matrix, and pass in *axis=0*, we'll find the sums over the first axis of the array. This will give us the sum of all the values in every column.

In [None]:
wines.sum(axis=0)

In [None]:
# check number of columns
wines.sum(axis=0).shape

If we pass in *axis=1*, we'll find the sums over the second axis of the array. This will give us the sum of each row.

In [None]:
wines.sum(axis=1)

There are several other methods that behave like the sum method, including:

* *numpy.ndarray.mean* — finds the mean of an array.
* *numpy.ndarray.std* — finds the standard deviation of an array.
* *numpy.ndarray.min* — finds the minimum value in an array.
* *numpy.ndarray.max* — finds the maximum value in an array.

You can find a full list of array methods here: https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html

### NumPy Array Comparisons

NumPy makes it possible to test to see if rows match certain values using mathematical comparison operations like <, >, >=, <=, and ==. For example, if we want to see which wines have a quality rating higher than 5, we can do this:

In [None]:
wines[:,11] > 5

In [None]:
wines[:,11] == 10

### Subsetting

One of the powerful things we can do with a Boolean array and a NumPy array is select only certain rows or columns in the NumPy array. For example, the below code will only select rows in wines where the quality is over 7:

In [None]:
high_quality = wines[:,11] > 7
wines[high_quality,:][:3,:].astype(float)

We select only the rows where *high_quality* contains a *True* value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria. For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses, and separate conditions with an ampersand (&):

In [None]:
high_quality_and_alcohol = (wines[:,10] > 10) & (wines[:,11] > 7)
wines[high_quality_and_alcohol,10:]

### Reshaping NumPy Arrays

We can change the shape of arrays while still preserving all of their elements. This often can make it easier to access array elements. The simplest reshaping is to flip the axes, so rows become columns, and vice versa. We can accomplish this with the *numpy.transpose* function:

In [None]:
print (wines.shape)
np.transpose(wines).shape

We can use the *numpy.ravel* function to turn an array into a one-dimensional representation. It will essentially flatten an array into a long sequence of values:

In [None]:
array_one = np.array(
    [
        [1, 2, 3, 4], 
        [5, 6, 7, 8]
    ]
)

array_one.ravel()

For our wine dataset:

In [None]:
print (wines[:2, :])

In [None]:
print (wines[:2, :].ravel())

Finally, we can use the *numpy.reshape* function to reshape an array to a certain shape we specify. The below code will turn the second row of *wines* into a 2-dimensional array with **2** rows and **6** columns:



In [None]:
wines[1,:]

In [None]:
wines[1,:].reshape((2,6))

### Combining NumPy Arrays

With NumPy, it's very common to combine multiple arrays into a single unified array. We can use *numpy.vstack* to vertically stack multiple arrays. Think of it like the second arrays's items being added as new rows to the first array.

In [None]:
wines.shape

In [None]:
wines2 = np.vstack((wines, wines))
wines2.shape

In [None]:
print(wines2[0])
print(wines2[1599])

If we want to combine arrays **horizontally**, where the number of rows stay constant, but the columns are joined, then we can use the *numpy.hstack* function. The arrays we combine need to have the same number of rows for this to work.

Finally, we can use *numpy.concatenate* as a general purpose version of *hstack* and *vstack*. If we want to concatenate two arrays, we pass them into *concatenate*, then specify the *axis* keyword argument that we want to concatenate along. Concatenating along the first axis is similar to *vstack*, and concatenating along the second axis is similar to *hstack*:

In [None]:
wines3 = np.concatenate((wines, wines), axis=0)
wines3

In [None]:
wines3.shape

### A NumPy Array Challenge!

Now that you understand NumPy arrays a bit, let's see if you can complete a challenge. 

* Create a 3 x 4 array filled with all zeros, and a 6 x 4 array filled with all 1s.

* Concatenate both arrays vertically into a 9 x 4 array, with the all zeros array on top.

* Assign the entire first column of the combined array to first_column.

### Hint

* *np.zeros*
* *vstack*
* Slicing