We give an overview of two Python libraries that are very commonly used in machine learning applications: numpy and pandas.

First, you need to make sure that they are installed in the environment you are working on. You can uncomment (delete the #) and run the following cell to install them for now, but we'd like you to have an environment setup for all of your Data Mine work so that you do not need to re-install all packages you use every time (there will be more than just these two). If you are not familiar with creating conda environments, this is a great task to accomplish during your first lab, so that your TAs can walk you through it step by step. Instructions on how to do so on Anvil can also be found here: https://www.rcac.purdue.edu/knowledge/Anvil/run/examples/apps/python/packages However, this process can be tricky if you have never done it before, so don't get discouraged, get someone to help you, and just get it set up. It can be a little tedious, but things are way more fun once this step is done.

If you installed pandas and numpy in your environment, you won't need to do it again, so we commented the next cell.  

In [None]:
# pip install numpy
# pip install pandas

Once a library is installed, it is sufficient to import it to be able to use it for the rest of the notebook. We can simply import a package (for example, typing "import numpy"), or we can import it and give it a shorthand that we can refer to it by later on. Usually we shorten numpy by np and pandas by pd. This means that every time we want to reference them later on we can just use np and pd. In the long run this makes things faster and easier.

In [None]:
import numpy as np
import pandas as pd

Now that we have our libraries available, let's see what we can do with them. We will begin with numpy.

## Numpy

Numpy is used to deal with arrays of numbers. These are useful if we want to store numerical data. For example, suppose I want to store data relative to the average temperature in West Lafayette every day for the past week. We will first store this data as a Python list, and then convert it to a numpy array. As I'm writing this in December, this may look something like this:

In [None]:
# create a list of values
temperature_list = [29, 32, 25, 30, 34, 37, 41]

# convert the list to a numpy array
temperature_array = np.array(temperature_list)

Having converted our list to an array, we can use numpy built in functions to answer questions about our numerical list, such as what was the maximum temperature recorded? What was the minimum? What was the average over all seven days? There are more built in functions that one can use with numpy. A full list can be found here: https://numpy.org/doc/stable/reference/routines.math.html
Most of us don't have all of these memorized, and neither are you expected to, but it is good to get a sense for what's available, so that when you need to use a function you can look up the precise syntax.

In [None]:
# what's the maximum temperature in the array?
np.max(temperature_array)

In [None]:
# what's the minimum temperature in the array?
np.min(temperature_array)

In [None]:
# what's the average temperature in the array?
np.average(temperature_array)

Referencing the list of functions above, try to answer the following questions yourself, using numpy built in functions when useful.

In [None]:
# Q1: what's the sum of all temperatures in the array?

In [None]:
# Q2 : given that the formula to conver Fahrenheit to Celsius is
# °C = (°F - 32) × 5/9, and that you can apply a formula to an entire array
# at once, convert the temperature array from °F to °C.

We can create arrays with multiple rows too. For example, suppose we want to record the average temperatures for the last three weeks. We may use the following array.

In [None]:
temp_3_week = np.array([[40, 45, 43, 37, 32, 35, 38],
                       [41, 43, 44, 40, 37, 33, 31],
                       [29, 32, 25, 30, 34, 37, 41]])

In [None]:
# let's display it
temp_3_week

array([[40, 45, 43, 37, 32, 35, 38],
       [41, 43, 44, 40, 37, 33, 31],
       [29, 32, 25, 30, 34, 37, 41]])

In [None]:
# Q3: use the same method as above to compute maximum, minimum, average,
# sum of all entries and to convert it to Celsius

Arrays have an attribute called shape that tells us the number of rows and columns in them. For example, here are the shapes of the two arrays we've used so far.

In [None]:
temperature_array.shape

In [None]:
temp_3_week.shape

The total number of elements in an array is its size:

In [None]:
temperature_array.size

In [None]:
temp_3_week.size

You may read up here (https://numpy.org/doc/stable/user/quickstart.html) on how to initialize arrays. After reading this page, you should be able to answer the following questions.

In [None]:
# Q4: create an array of zeros with 4 rows and 3 columns

In [None]:
# Q5: create an array of ones with 2 rows and 16 columns

In [None]:
# Q6: create an array with random entries with 6 rows and 2 columns

In [None]:
# Q7: create an array with 4 entries consisting of
# evenly spaced integers between 2 and 8 included

As we conclude our brief overview of numpy, we look at one last feature of this library. It is sometimes convenient to look at arrays of arrays. For example, we might want to consider an object that looks like:

In [None]:
example_array = np.random.rand(2, 3, 4)
example_array

This created two arrays, with three rows and four columns each, which together form a unique numpy object. If we want to look at a particular entry of an array, we can use the following notation to extract it:

In [None]:
# extract the first entry of the temperature_array
# remember that Python starts counting at 0

temperature_array[0]

In [None]:
# extract the temperature of the fourth day of the second week in the
# temp_3_week array

temp_3_week[1][3]

In [None]:
# extract the temperature of the last day of the third week in the
# temp_3_week array

temp_3_week[-1][-1]

In [None]:
# we could also have used
temp_3_week[2][6]

In [None]:
# Q8: extract the object in the second array, third row, fourth
# column of the example_array

## Pandas

Sometimes we will want to consider datasets larger than just one array, and that's where it's often convenient to use the pandas library. As a first step in that direction, let us load a famous dataset from a library called sklearn. Again, you may need to install it first:

In [None]:
# pip install -U scikit-learn

In [None]:
# import sklearn
import sklearn

# import the library that allows us to load the dataset we will use, called iris
from sklearn.datasets import load_iris

In [None]:
# now we can call the load_iris() function to import the dataset
# we give it the name iris_data
iris_data = load_iris()

You'll find a description of the iris dataset here: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html

We could have a look at the data contained in it just by doing the following:

In [None]:
# display the data we just imported
iris_data

Take a look at that output. You should find confirmation of what you read on the page we linked above. For example, you should have read that this dataset consists of 150 samples with 5 attributes each. The first four attributes, usually referred to as input features, are numerical measures of four characteristics of types of flowers: sepal length, sepal width, petal length, petal width. The fifth value, usually referred to as target, is a classification of the flower mesured in that row into one of three possible classes: setosa, versicolor, or virginica. These are encoded as 0 for setosa, 1 for versicolor, and 2 for virginica.

For example, let's consider the first row of the data set. We use ".data" to extract the four numerical columns and ".target" to extract the target. Then we single out which row we'd like to access using the same notation with square brackets that we encountered above, since data and target are arrays. We'll see the four measurement and a value of 0 for the target, meaning that this flower is a setosa.

In [None]:
print(iris_data.data[0])
print(iris_data.target[0])

An alternative way to visualize this dataset can be to use pandas. Pandas allows to store data in a structure called a data frame. Our goal is to create a data frame with five columns, one for each of sepal length, sepal width, petal length, petal width, and target. We begin by combining the arrays for the input features and the target into a single array called all_iris_data. We use the numpy function concatenate to do so. You find this function's syntax here: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
If we simply try to concatenate the two arrays, we get an error message telling us that the input arrays don't have the same dimensions. Let's try.

In [None]:
all_iris_data = np.concatenate((iris_data.data, iris_data.target), axis = 1)

Let's inspect the dimensions of our arrays as we learnt above.

In [None]:
iris_data.data.shape

In [None]:
iris_data.target.shape

It appears that the issue is that the target aray is seen just as a list of values instead of a 150 x 1 matrix. Luckily, there's an easy fix: we use the numpy function reshape (more on it here: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html). We'll then be able to join the arrays as we were trying to do before.

In [None]:
target_data = np.reshape(iris_data.target, (150, 1))

In [None]:
# let's check that the shape of this new array is the desired one
target_data.shape

In [None]:
# we can now join the two arrays
all_iris_data = np.concatenate((iris_data.data, target_data), axis = 1)

In [None]:
# let's display our new array
all_iris_data

We can now convert our dataset into a data frame. We will need to assign names to the columns of the data frame. We'll use sepal length, sepal width, petal length, petal width, target as our five column names.

In [None]:
iris_df = pd.DataFrame(all_iris_data,
                       columns = ['sepal length', 'sepal width',
                                  'petal length', 'petal width', 'target'])

In [None]:
# let's display the data frame we just created
iris_df

There are a lot of operations one can perform on a data frame. I'd recommend taking a look at this page (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) outlining several of them to get an idea for what we can do with them. Maybe try out a few yourself usign the data frame we just created. For example, with the help of this webpage, you shuold be able to answer the following questions.

In [None]:
# Q9: a) print the names of the columns of the data frame
# b) display row 48 to 55 of the data frame
# c) display column 'petal width'
# d) create a new data frame called test_df that only contains
# the first three columns of iris_df
# e) display the first 7 rows of the data frame

You've reached the end of our first notebook. Please don't be afraid to reach out to your TAs or your more experienced classmates if anything is not clear: they'll be happy to help make sure that the whole team is familiar with these concepts before we move on.

Our next task will be to use the input columns of the iris data frame to predict the target value using a basic machine learning algorithm. If you're already familiar with these ideas, give it a try yourself before checking out our approach. As is very often the case in machine learning, many different approaches are possible.

In [None]:
# ANS1:
# np.sum(temperature_array)

In [None]:
# ANS2:
#temp_array_C = (temperature_array - 32)*5/9

In [None]:
# ANS3:
# maximum3weeks = np.max(temp_3_week)
# minimum3weeks = np.min(temp_3_week)
# average3weeks = np.average(temp_3_week)
# sum3weeks = np.sum(temp_3_week)
# celsius3weeks = (temp_3_week - 32)*5/9

In [None]:
# ANS4:
# np.zeros((4, 3))

In [None]:
# ANS5:
# np.ones((2, 16))

In [None]:
# ANS6:
# np.random.rand(2, 6)

In [None]:
# ANS7:
# np.arange(2, 10, 2)

In [None]:
# ANS8:
# example_array[1][2][3]

# alternatively,
#example_array[-1][-1][-1]

In [None]:
# ANS9:
# a)
# iris_df.columns
# b)
# iris_df.iloc[48:56]
# c)
# iris_df['petal width']
# d) several solutions are possible. One example:
# test_df = iris_df.drop(columns=['petal width', 'target'])
# e)
# iris_df.head(7)