# Data Ingestion: Tabular Data using NumPy

**What is NumPy**

NumPy (Numerical Python) is an open source Python library that allows fast and efficient operations on numerical data. It is a dependency for many Data Science and Machine Learning libraries such as: Pandas, SciPy (Scientific Python), Scikit-Learn, Matplotlib (for data visualization) etc. Click [here to learn more about numpy](https://numpy.org/doc/stable/user/whatisnumpy.html). It has many built-in functions that allow us to easily import numerical data, and perform various transformations on it.


This notebook will briefly introduce: 1) the NumPy array and basic operations that can be performed, and 2) how to load tabular (numerical) data into a numpy array.

The outline is as follows:

1. <a href='#1.-Intro-to-NumPy'>Intro to NumPy</a>
   1. <a href='#1.1-How-to-Create-an-Array'>How to Create an Array</a>
   2. <a href='#1.2-Reshaping-an-Array'>Reshaping an Array</a>
   3. <a href='#1.3-Indexing-&-Slicing-an-Array'>Indexing & Slicing</a>
   4. <a href='#1.4-Useful-Operations-with-arrays'>Useful Operations with arrays</a>
   5. <a href='#1.5-Filtering-Arrays'>Filtering Arrays</a>
   6. <a href='#1.6-Compare-Numpy-Arrays-with-Lists'>Compare Numpy Arrays with Lists</a>
2. <a href='#2.-Loading-Tabular-Data'>Loading Tabular Data</a>

## 1. Intro to NumPy

NumPy can be used to ingest data in tabular format and store it in an array. However, before we start to load tabular data, let's review a few basics and learn about numpy arrays and some of the mathematical operations that can be performed on data inside them.

**First we need to import numpy**

In [1]:
#Import libraries
import numpy as np #np is the widely adopted alias for numpy


### 1.1 How to Create an Array
An array is another type of data structure that contains elements of the same type i.e. the elements inside the array should either be integers or floats etc but never a mixture of types such as both integers and strings.

There a many ways to create an array, and we will cover the following.

How to create arrays:
- from another python data structure e.g. lists and tuples
- using custom numpy functions e.g. zeros, ones, arrange, etc

For more information on the above, and also other ways to create arrays, read the following resource about [array creation in numpy](https://numpy.org/doc/stable/user/basics.creation.html)

#### 1.1.1 Create a NumPy Array from Another Python Data Structure (i.e. Existing Data)
One of the quickest ways to create a numpy array is from an existing Python list. For example, if we had the following python list which contains the following numbers:

```Python
#List definition and initialization
numbers = [1, 3, 5, 7, 9]
```

We can create the equivalent array as follows

In [2]:
#numbers = [1, 3, 5, 7, 9]  #this is the list definition

numbers = np.array([1, 3, 5, 7, 9]) #this is the equivalent numpy array for the list above


In [3]:
#we can verify the type using the type function
type(numbers)

numpy.ndarray

Similar to lists, numpy arrays are also zero-indexed and the elements can be accessed using square brackets. For example, lets display the first number:

In [4]:
print(numbers[0])

1


In [5]:
#we can verify the type of each element in the array using the dtype method
numbers.dtype

dtype('int64')

**Dimensions of a NumPy array**

The output above reveals that the variable `numbers` has a type thats called an `ndarray`, which is the abbreviation for **N-dimensional array**. An n-dimensional array is an array with any number of dimensions. It can be one-dimensional i.e. 1-D array, two-dimensional i.e. 2-D array, 3-D or higher dimensional arrays (often called n-dimensional, where **n** is the number of dimensions).

The illustration below demonstrates the difference between 1-D, 2-D and 3-D arrays. The 1-D array contains one row with four elements: 7, 2, 9, and 10. This is similar to a vector. 

The 2-D array contains 2 rows, and each row has three elements. This is similar to a matrix. The number of dimensions and lengths along each dimension can vary, depending on the dataset. However, we can use the `.shape` property of the array to find out its dimensions.


<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" width=450></img>
<p style="font-size:8px; text-align:center">Credit: https://www.oreilly.com/library/view/elegant-scipy/9781491922927/ch01.html</p>

In [6]:
#1D array shape
print(numbers.shape)

(5,)


In [7]:
#declare a 2D array and check its shape
multi_numbers = np.array([[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]])
print(multi_numbers)
print(multi_numbers.shape)

[[ 1  3  5  7  9]
 [ 2  4  6  8 10]]
(2, 5)


#### 1.1.2 Create a NumPy Array Using Custom Numpy Functions
Numpy arrays can be created using functions such as: `empty()`, `arange()`, `ones()`, `zeros()`, etc. These functions allow the user to customize the shape of the array and its initial elements. For example:

- `empty()`: create an empty array with random numbers
- `zeros()`: create an array filled with zeros
- `ones()`: create an array filled with ones
- `arange()`: create an array from a range of numbers (similar to the built-in `range()` function)

Here are a few examples below.

In [8]:
#create an empty array with 5 elements
np.empty(5)

array([0., 0., 0., 0., 0.])

In [9]:
#create an array thats initialized with zeros
np.zeros(10) #the array will contain 10 elements


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [10]:
##create an array thats initialized with ones
np.ones(3)#the array will contain 3 elements

array([1., 1., 1.])

In [11]:
#create a 2-d array thats filled with ones
np.ones((2, 4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

The default data type for numpy array is floating point **np.float64**. However, you can explicitly specify which data type you want using the **dtype** keyword. For example:

In [12]:
#specifying a data type
np.ones(2, dtype=np.int64)


array([1, 1])

Numpy arrays can be created with a range of elements using `arange()`, which works in a similar manner as the `range()` function. Here are a few examples:

In [13]:
#create an array from a range of numbers between 0 and 8 (excluding 8)
np.arange(8)

array([0, 1, 2, 3, 4, 5, 6, 7])

In [14]:
#create an array of numbers in ascending order, between 1 and 10 (excluding 10)
np.arange(1, 10)


array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
#create an array of even numbers in descending order, between 10 and 1.
np.arange(10, 1, -2)

array([10,  8,  6,  4,  2])

Another helpful method is to create an array and initialize all elements with the same value and the dtype. The example below creates a 2-d array with all elements initialized to 50. Since 50 is an integer, all elements will have the **np.int64** data type.

In [16]:
#create a 2-d array thats initialized with the same value and type
np.full((2, 3), 50) 

array([[50, 50, 50],
       [50, 50, 50]])

### 1.2 Reshaping an Array

Numpy arrays can be transformed (i.e. reshaped) into a different number of dimensions without changing the original data.  However, the new shape must have the same number of elements as the original array.

For example, we previously declared an array called multi_numbers which was a 2x5 array. Let's reshape it:

In [17]:
multi_numbers = multi_numbers.reshape(5,2)
multi_numbers

array([[ 1,  3],
       [ 5,  7],
       [ 9,  2],
       [ 4,  6],
       [ 8, 10]])

As you can see the original elements remain unchanged, but the shape is different.

### 1.3 Indexing & Slicing an Array

Numpy arrays can be sliced and indexed similar to python lists. You can learn more about slicing at the following resource: [Indexing & Slicing](https://numpy.org/doc/stable/user/absolute_beginners.html#indexing-and-slicing).

Here are a few examples:

In [18]:
#slicing 1-d array
numbers[1:3] #1 and 3 are the start and end indices

array([3, 5])

In [19]:
#slicing 2-d array
multi_numbers[1:4, 1] #you can specify start and end indices for each dimension i.e. axis

#PRACTICE: try different values to see how to result changes

array([7, 2, 6])

### 1.4 Useful Operations with arrays

Numpy has builtin functions that enable basic arithmetic operation on one or more arrays. You can learn more about these operations at the following resource: [useful array operations](https://numpy.org/doc/stable/user/absolute_beginners.html#more-useful-array-operations).

In the example below, an array was created with three elements:

```python
data = np.array([1, 2, 3])
```

We can use methods like `max()`, `min()` and `sum()` as follows:

<p style="font-size:8px; text-align:center"><img src="https://numpy.org/doc/stable/_images/np_aggregation.png" width=450></img>
Source: https://numpy.org/doc/stable/user/absolute_beginners.html#more-useful-array-operations </p>                                                                       


In [20]:
data = np.array([1, 2, 3])
print('The maximum value is: ', data.max())
print('The minimum value is: ', data.min())
print('The total value is: ', data.sum())

The maximum value is:  3
The minimum value is:  1
The total value is:  6


There are other methods, such as: `mean()`, `std()` and `var()` to obtain the average, standard deviation and variance respectively. I encourage you to try them as an exercise.

### 1.5 Filtering Arrays

Arrays can be filtered to extract values based on a condition. When writing the logic to filter an array, you'll notice a lot of similarity with comparison operators that you previously learned about. For example:

In [21]:
#create an array with 10 numbers
ten_numbers = np.arange(1, 10)

#filter the array to obtain all numbers that are greater than 5
result = ten_numbers[ten_numbers > 5]
print(result)

[6 7 8 9]


The example above uses comparison operators that you previously learned about. Using the name of the array that we would like to filter, the condition is placed inside square brackets. This extracts only the elements in the array that satisfied the condition.


### 1.6 Compare Numpy Arrays with Lists

Let's look at one example that demonstrates one of the key differences between numpy arrays and python lists. We'll first create a sequence of numbers between 1 and 5 (inclusive), and then increase each element in the sequence by a scalar (i.e. 2). 

In [22]:
#create an array with 5 numbers
array_of_numbers = np.arange(1,6)

#create a list with the same 5 numbers
list_of_numbers = [1, 2, 3, 4, 5]

#increase each value by 2
array_result = array_of_numbers * 2 
print('The result from an array: ', array_result)


list_result = list_of_numbers * 2
print('The result from a list: ', list_result)

The result from an array:  [ 2  4  6  8 10]
The result from a list:  [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]


As you can see from the output above, the numpy array performs an element-wise calculation to increase each value in the array by a multiple of 2. The python list, however, duplicates the list. In order to achieve the same result using a list, we need to adopt a different approach (which *may* involve a loop).

This is one of the reasons that numpy arrays are preferred over python lists, especially when performing numerical operations which can be evaluated using element-wise calculations to avoid the overhead of using a loop.

## 2. Loading Tabular (Numeric) Data
NumPy has two standard ways to import flat files, they are: `numpy.loadtxt()` and `numpy.genfromtxt()`. Click the following link to learn more about [reading and writing files](https://numpy.org/doc/stable/user/how-to-io.html) with numpy.

This example demonstrates how to read comma delimited file using the **loadtxt()** function. I encourage you to first open this csv file: `gpa_study_hours.csv` and view its contents before running the cell below.

The data in the csv file contains the GPA and study_hours for a sample of 193 undergraduate students at a private US university. You can view the following webpage, for more information about the dataset: [Student GPA and Study Hours Dataset](https://www.openintro.org/data/index.php?data=gpa_study_hours).

Let's load the csv file using the **loadtxt()** function. Seeing that the file is delimited by commas and includes a header, we need to specify the delimiter and skip the header when importing the data.

In [23]:
data = np.loadtxt('gpa_study_hours.csv', #the file name
                  delimiter=',', #indicate the delimiter
                  skiprows=1) #skip the header

#verify the data was imported correctly
print(data[:5,]) #show the first five rows of the data


[[ 4.   10.  ]
 [ 3.8  25.  ]
 [ 3.93 45.  ]
 [ 3.4  10.  ]
 [ 3.2   4.  ]]


As you can see, we successfully imported the data. If necessary, you can verify the type to confirm that it is a numpy array using the type function. 

For example:
```Python
type(data)
```

Now that the data is in a numpy array, we can learn more about our dataset and ask certain questions like:
- whats the highest and lowest GPA in this dataset?
- On average, how many hours do students with a 4.0 gpa study?
- etc...

These initial questions help us to understand the dataset and learn more about the sample of students that are being analyzed.

Let's write python code to answer these questions and we will use the built-in methods that numpy provides.

In [24]:
#Get the minimum GPA. The GPA is in the first column, which is index 0
min_gpa = data[:,0].min()
print('The minimum GPA in the dataset is: ', min_gpa)

The minimum GPA in the dataset is:  2.6


In [25]:
#Get the maximum GPA
max_gpa = data[:,0].max()
print('The maximum GPA in the dataset is: ', max_gpa)

The maximum GPA in the dataset is:  4.3


In [26]:
#filter the data to find all students with 4.0 gpa and assign the result to a new variable
gpa = data[data[:,0] == 4.0]

#calculate the average study hours. The study hours is in the second column, which is index 1
avg_study_hours = gpa[:,1].mean()

print('The average study hours for students with a 4.0 gpa is: ', avg_study_hours)

The average study hours for students with a 4.0 gpa is:  15.777777777777779
