# CSC-321: Data Mining and Machine Learning


## Assignment 0: Introduction to colab

### Part 1: Software overview

In this class, all your assignments will be done in Google Colab. If for any reason you cannot work in Google Colab, then it is possible to work locally, if you install all the necessary libraries. 

This whole document is a Jupyter notebook: https://jupyter.org/

Jupyter is a browser based interactive environment for coding. In this case we are using Python 3.

We're using (at least) the following libraries in Python. All of these are already installed on Colab:

- pandas (for data manipulation): https://pandas.pydata.org/
- matplotlib (for plotting): https://matplotlib.org/
- scikit learn (python ML library): https://scikit-learn.org/stable/
- plotly (for some graphing): https://plot.ly/python/

It is possible to install all of this locally using Anaconda: https://www.anaconda.com/ 

Or you can install libraries that enable jupyter notebooks with IDEs such as Visual Studio Code: https://code.visualstudio.com/docs/python/jupyter-support

Working in Google Colab means you do NOT need a local install, and you will not be limited by your own computing power (although there are other limitations). 




### Part 2: Jupyter overview

Notebooks are comprised of cells. For example, this is the second cell in this notebook. There are two types of cells in colab, code and text. You can add cells using the buttons at the top of the notebook.

Any cell marked with square brackets, sometimes containing a number, such as:

[1]

represents a cell of Python code. You can hit the run button next to the box, and it will execute the code, showing the output in the following cell.

Any cell marked

[->

represents the output of the previous cell, including graphs, tables and basic output.

Text cells are where you can write or edit information. The format of these cells is *markdown*. You can find basic markdown information here: https://www.markdownguide.org/basic-syntax/

For a list of keyboard shortcuts, use: Ctrl/⌘ + M + H

Below, I've added a code cell, and I've put some basic code in it. Run the cell below.

In [8]:
# First code cell
# Hello world

print('Hello World')
a = 27
print('A contains:',a)


Hello World
A contains: 27


I've added another text cell. Now I'm going to create a new code cell below.

In [9]:
def makeA():
  return 103

a = makeA()
print('A contains:',a)

A contains: 103


One of the slight weird qualities of colab (and jupyter) is that you can run cells out of order. For instance, if you now run the cell below, it should print 103. But if you then go back and run cell 2 (above, with the hello world code) and THEN run the cell below, without running the makeA code above, it will say something else.


In [10]:
print('A contains',a)

A contains 103



To make sure you don't get into trouble, always remember to use the run before, or run all cells to make sure your notebook is coherent. In cells you can also import libraries, and do pretty much anything you can do in regular Pythin IDEs. For example:


In [11]:
import math

number = float(input('Enter a number:'))
print('The number is:', number)
print('The sqaure root of your number is:',math.sqrt(number))

Enter a number:8546580
The number is: 8546580.0
The sqaure root of your number is: 2923.4534372895355


Markdown in text cells can be used to render things like LaTeX formulas, such as the formula for Euclidean distance seen below:

$$distance=\sqrt{\sum_{i=1}^n (x1_{i} - x2_{i})^2}$$


And tables:

Name | Date | Number System | Mechanism | Programming | Turing Complete? 
 --- | --- | --- | --- | --- | --- 
 Zuse Z3 | 1941 | Binary | Electro-Mechanical | 35mm film | Yes
 Atanasoff-Berry Computer (ABC) | 1942 | Binary | Electronic | None | No
 Harvard Mark 1 / IBM ASCC | 1944 | Decimal | Electro-Mechanical | Punched paper tape | No
 Colossus | 1944 | Binary | Electronic | Patch cables and switches | Yes
 ENIAC | 1946 | Decimal | Electronic | Patch cables and switches | Yes
 Manchester Small-Scale Experimental Machine (BABY) | 1948 | Binary | Electronic | Stored Program | Yes
 EDSAC | 1949 | Binary | Electronic | Stored Program | Yes

Jupyter can also be used to render HTML in code cells. An example is given below.

In [12]:
from IPython.display import HTML, display

# Create some data

attribs = ['attribute','one','two','three','four']
coeffs = ['value',27, "sausage", 32.78, -0.03]
data = list(zip(attribs,coeffs))

# Create HTML to make a table
# If you replace: 
#
# </td><td style="background-color: cyan">
#
# with:
#
# </td><td>
#
# then the color goes away

display(HTML(
   '<table><tr>{}</tr></table>'.format(
       '</tr><tr>'.join(
           '<td>{}</td>'.format('</td><td style="background-color: cyan">'.join(str(_) for _ in row)) for row in data)
       )
))
print('\nA table using HTML')

0,1
attribute,value
one,27
two,sausage
three,32.78
four,-0.03



A table using HTML


### Part 3: Accessing data

For a last part of this notebook, I'm going to introduce Pandas. Pandas is a very useful library for loading and manipulating data. We'll look at some useful functions of pandas as we go through the class, but for now, I want to demonstrate loading data. 

There are two code cells below. The first loads a CSV (comma separated value) file directly from a URL. In this case, it's a CSV file that is hosted on my github page.

I load this data into a pandas data structure called a dataframe, and then use the head() method on that dataframe to display the first five rows.

In [13]:
import pandas as pd

# Assign column names
labels = ['att0','att1','att2','att3','att4','att5','att6','att7','att8','att9','att10','att11']

# Load the data
wine_data = pd.read_csv("https://raw.githubusercontent.com/TJSchlueter/Class_Data/main/winequality-white.csv",names=labels)

# Show the head of the data
wine_data.head()

Unnamed: 0,att0,att1,att2,att3,att4,att5,att6,att7,att8,att9,att10,att11
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


Now I do the same thing below, but this time the wine data set CSV is stored locally in my Google Drive account. 

You should have a Colab Notebooks directory, automatically created on your drive. 

Inside that directory in **my** Google Drive, I created a Data folder. You should probably also create a CSC321 folder to store your notebooks for this class.

You'll need to download the wine data set from Nexus, and upload it to your folder on Google Drive.

When you execute the code below, you'll have to authorize Colab accessing your account. This will have to be executed again periodically.


In [None]:
# Authorize drive
# This has to be done periodically: Once your notebook diconnects from the kernel
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Accessing data from Google Drive

file = '/content/drive/My Drive/Colab Notebooks/CSC321/assignment0/winequality-white.csv'

# Assigning column names
labels = ['att0','att1','att2','att3','att4','att5','att6','att7','att8','att9','att10','att11']

# Using pandas to read the data
df = pd.read_csv(file,header=None,names=labels)
df.head()