# Day 6

* [NumPy](#numpy)
* [Working with Files](#files)
* [Preparing Data with Numpy](#preparing-data-with-numpy)

## Numpy

Today we will learn about manipulating numerical data from a dataset. We will use the `numpy` module, NumPy stands for numerical python. 

Working with a list of lists generally requires nested loops. But using the `numpy` module, we can convert the list of lists to a numpy `array`. This makes it easier to do calulcations with the data. 

To use the `numpy` module, we need to first import it. 

In [None]:
# import numpy with the alias np


### Creating 1D arrays

A one-dimensional array is also called a vector. We can create 1D arrays using `np.zeros()`, `np.ones()`, `np.random.rand()` functions. 

We only need a single index to retrieve an element from a 1D array.  

In [None]:
# Creating 1D arrays 
# A 1D array is also called a vector 

# Create a variable arr1 that is a 1D array of 3 elements, all zeros. 

# Create a variable arr2 that is a 1D array of 10 elements, all ones. 

# Create a variable arr3 that is a 1D array of 5 elements, all ones.

# Create a variable arr4 that is a 1D array of 3 elements, all random values.

# Output arr1, arr2, arr3, arr4


In [None]:
# Output the shape of each of the 1D arrays 


In [None]:
# Can convert 1D arrays back to lists 


In [None]:
# Selecting a row or a column from a 2D numpy array will yield a 1D array 


In [None]:
# Selecting a row or a column from a 2D numpy array will yield a 1D array 


In [None]:
# We can fetch individual elements from row_zero and col_zero


### Useful Functions in NumPy

In [None]:
# Print random_array again to refer to the values 


In [None]:
# Sum of all values in first column 


In [None]:
# Max value in first column 


In [None]:
# Min value in first column 


In [None]:
# Standard deviation in first column 


### Important Details about NumPy 

Documentation: https://numpy.org/doc/stable/

1. NumPy arrays (numpy.ndarray) have a fixed size. Unlike Python lists, we cannot append things to them. Changing the size of an ndarray would create a new array and deleate the original.
2. We can change the values of items. 
3. The elements in a NumPy array are all required to be of the same data type (again, a point of difference from lists)
4. Arrays can be 1D, 2D, or multi-dimensional.
5. NumPy stores values using its own data types and these map to the Python datatypes we have covered so far. 

### 2D Arrays

This are also known as "matrices". A matrix has rows and columns. 

In [None]:
# Creating an empty array with 3 rows and 4 columns
# Specify the shape of the array i.e. number of rows and columns  
# tuple, like a list but immutable


One can create an empty array of zeros, which can be updated later based on necessary calculations. 

Similary, we can create an array of ones as well as an array of random numbers. 

In [None]:
# Creating an array of ones with 5 rows and 5 columns


In [None]:
# Creating an array of 3 rows and 4 columns where each element is a random number 
# All random numbers would be between 0.0 and 1.0 
# Use the np.random.rand() function to generate the random numbers


In [None]:
# Indexing the random_array
# Getting the item in first row and first column 


In [None]:
# Getting the item in last row and last column 


In [None]:
# Getting the item in 2nd row and 3rd column 


In [None]:
# Changing the value of the element in second row and third column


In [None]:
# Getting multiple items
# Getting the first two items from the fourth column 


In [None]:
# Getting the first two items from the third row 


In [None]:
# Getting the second and the third items from the third row 
# x:y... it will get values at indices x to y-1   


In [None]:
# Selecting the entire fourth column 


In [None]:
# Selecting the entire second row 


In [None]:
# Can select the entire array using the slice operator 


In [None]:
# Getting the datatype of any element


## Working with Files

We will now learn how to load data from external sources, namely from files.

It is advisable to keep the files in the same folder as your Jupyter Notebook. 

To open files in Python, we use the built-in function `open()`

The `open()` function requires the name of the file as the first argument. This is to identify the file. Other arguments are optional but generally the second argument is also provided to specify the mode in which to open the file. Following are some of the modes which are used most often:

1. `'r'`: read only mode, to only read the data from the file. Also, the default value of mode argument  
2. `'w'`: write mode to also overwrite the existing data and write new data to the file 
3. `'a'`: append mode to write new data to the exsiting data 

Further documentation on arguments of the `open()` function and various file modes: https://docs.python.org/3/library/functions.html#open

In [None]:
# We will use data on red wine, available from UCI Machine Learning Repository 
# Link to the full dataset: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
# Tutorial reference on NumPy: https://www.dataquest.io/blog/numpy-tutorial-python/

# Let's assign the name of the file ("day6-winequality-red.csv") to a variable filename 
# Since filenames are text, it has to be assigned to the variable as a string 



In [None]:
# Since it is a CSV file, let's use the CSV module to quickly read data from it. 
# Documentation: https://docs.python.org/3/library/csv.html
# All we need is the `reader` function, which takes a file object as input and returns a list of lists (eache line of the file is a list).


In [None]:
# Opening the file using the open() function 
# use the csv.reader with the correct delimiter (",") to read the entire file into a variable.

# Let's print the datatype of the output 


In [None]:
# this is equivalent
# open file without `with` statement
# csvreader is a sequence of lists so we can use `for` to iterate over it
# remember to close the file if you don't use the `with` keyword!


In [None]:
# let's print all lines


Each item in wines also seems like a list. 

In [None]:
# Output number of items in the list 
# This also corresponds to the number of rows in the CSV file 


In [None]:
# Select a random item from wines. Output its datatype as well as the item. 

# this will generate a random number between 0 and the length of the list

# let's print the random item


In [None]:
# Output the number of items in a random list within wines 
# This also corresponds to the number of columns in the CSV file


## Preparing data with numpy

The numpy module has an `array` function that can be used to convert lists to arrays. 

Since the above variable, wines, is a list of lists, it would be converted to a two-dimensional (2D) array. The resulting array would have 1599 rows (excluding the header row) and 12 columns.

However, we want to leave the first item of wines, which contains header information. To do so, we can use the *slice* operator (:) that can also be used on lists and strings. 

In [None]:
# Using the slice operator to get rid of the first item (header row) in wines
# Let's get rid of the first row of wines and store the new list of lists in a variable wines_without_header

# Index 0 corresponds to row 1
# Index 1 corresponds to row 2 


In [None]:
# Creating a numpy array using wines_without_header
# it's good practice to specify the `dtype` (especially when loading external sources)


In [None]:
# Output the datatype of wines_array


We can check the number of rows and columns in the array using the shape **property** of numpy arrays. 

A property is different from a function. When using a **property**, you do not need to follow it with parentheses or specify any argument. 

In [None]:
# Getting the number of rows and columns in wines_array


### Lecture Practice (20 minutes) 

1. Print the item from the 5th row and 9th column of `wines_array` 
2. Print the item from the 800th row and 12th column of `wines_array`
3. Print the entire last column of `wines_array`
4. Print the entire last row of `wines_array`
5. Print the entire first row of `wines_array`
6. Print the entire first column of `wines_array`
7. Print the first 10 items of the 5th column of `wines_array`
8. Print the first item of `wines` (NOTE: `wines` is the list of lists we created initially and used for creating `wines_array`).
9. Get the first 3 items of the 3rd row of `wines_array`. Which header rows do they correspond to? Use the answer from question 8. 
10. What does the fourth column of `wines_array` indicate about the red wine dataset? 

In [None]:
# Solutions 

#1

#2

#3

#4

#5

#6

#7

#8

#9 Fixed Acidity

#10 


### Lecture Practice (15 minutes) 

1. Discuss the code in the cell below. Refer to the printed output above to identify the index of each column for questions 2-5.

In [None]:
# For lecture practice question 1, comment in the lines below to run the code
header_row = wines[0]

for i in range(0, len(header_row)):
    print("Colum {}: {}".format(i,header_row[i]))



2. Calculate the mean density of all wines. Also, calculate maximum density and minimum density.

3. Calculate the mean quality rating of all wines. Also, calculate the maximum quality rating and minimum quality rating.

4. Calculate the mean and standard deviation of fixed acidity for all wines.

5. Calculate the mean and standard deviation of alcohol content for all wines. 

In [None]:
# Practice problem 2


In [None]:
# Practice problem 3 


In [None]:
# Practice problem 4


In [None]:
# Practice problem 5
