# **The DS-100 Guide**


## Table of Contents
- [The Shared Computer Cluster (SCC)](#scc)
- [Vanilla Python](#python)
- [Packages](#packages)
    - [NumPy](#numpy)
    - [datascience](#datascience)
- [Understanding Documentation](#documentation)
- [GitHub](#github)

<sup>The table of contents is only usable in Jupyter Notebooks or on [nbviewer](https://nbviewer.org/github/langdon/ds-100/blob/cethan-ec_file_jupyter-draft/DS-100_Technical_Guide.ipynb). If you are viewing this on GitHub, you will need to simply scroll.</sup>

---

## **The Shared Computing Cluster (SCC)** <a id='scc'></a>

The SCC is what all students in DS-100 will be using to run the files required for their assignments. It is a diverse Linux cluster of numerous components including over 9,000 shared CPU cores, 250 GPU cores and 1 petabyte of storage. The SCC is mainly used for tasks that require high-performance computing from disciplines such as engineering, biostatistics, and machine learning. 


### **Accessing the SCC**

1. Ensure that you are not on Boston University’s “BU (802.1x)” wifi network
2. Go to [the SCC site](https://scc-ondemand.bu.edu)
3. Login to the SCC with you kerberos username and password (the account you use for the student link)
4. Optional: Save/bookmark this page as you will be accessing it often


---

## **Python** <a id='python'></a>


### **Data Types**

#### Int (integer) (Lec #4)

Ints in Python can be an integer of any size. Additionally, it never has a decimal point.

In [1]:
x_int = 1234

#### Float (int w/ decimal point) (Lec #4)

Floats always have a decimal point, but they have an optional fractional part. They also have a limited size and precision of 15-16 decimal places. Floats **can be wrong** in the final few decimal places after arithmetic (due to how they are stored in memory, see __[here](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate)__).

In [2]:
x_float = 12.34

#### Str (Strings & text) (Lec #4)

Strings are a set of characters of any length. Additionally, strings that consist of only numbers and/or one decimal point can be converted into ints and floats with their respective functions (int(), float()). While all values can be converted into a string, strings won't always be able to change back into their previous data type.

In [3]:
x_str = '1234'
print(x_str, type(x_str))

x_str_to_int = int(x_str)
print(x_str_to_int, type(x_str_to_int))

x_str_revert = str(x_str_to_int)
print(x_str_revert, type(x_str_revert))

1234 <class 'str'>
1234 <class 'int'>
1234 <class 'str'>


#### Arrays/Lists 

Arrays (or lists) contain a sequence of values.

- All elements of an array should have the same value
    - In python, an array/list can have a variety of different values
    - For arrays from NumPy or the datascience package, each element must be of the same type
- Arithmetic is applied to each element individually
- Adding arrays adds their respective elements together (if they're the same length)
    - Using '+' on lists from python will *not* add their elements together, it will instead *concatenate* multiple lists and create a new list
- A column of a table from the datascience package is an array

In [4]:
import numpy as np
np_array1 = np.array([1, 2, 3, 4, 5])
np_array2 = np.array([5, 4, 3, 2, 1])

# adding two numpy arrays together will add each element in the array 
# to the corresponding element in the other array
print('NumPy array:\n', np_array1 + np_array2)


print('------------------')


py_list1 = [1, 2, 3, 4, 5]
py_list2 = [5, 4, 3, 2, 1]

# adding two python lists together will concatenate the lists, NOT add the elements
print('Python list:\n', py_list1 + py_list2)


NumPy array:
 [6 6 6 6 6]
------------------
Python list:
 [1, 2, 3, 4, 5, 5, 4, 3, 2, 1]



### **Assignment Statements** (Lec #3)

Assignment statements, denoted by a *single* =, are a type of statement that binds a variable to a value. Additionally, variable names are *case-sensitive* (i.e. 'A' is not equal to 'a').

In [5]:
hw_due_date = 'On Thursday'
print(type(hw_due_date))

# The value and type can be overwritten

hw_due_date = ['On Thursday']
print(type(hw_due_date))

a = 0
A = 1
print(a == A)

<class 'str'>
<class 'list'>
False



### **Using Functions** (Lec #3)

A function is a block of code that will run whenever it is called. The variable listed inside of the parentheses are parameters, while the actual value that is sent to the function are arguments. Additionally, some functions can take *multiple* arguments (i.e. np.array(), max(), min(), etc).

**Some** functions have optional parameters. These parameters do not need to be specified in-order for the function to work. For example, in np.array() you can specify the type of data all elements should be, but NumPy can do that automatically if you choose to *not* specify.

In [6]:
# From Lec #4
# type() is a function, while x_int and x_float are parameters in this case
# print() is also a function, and it can take more than one argument

print(type(x_int), type(x_float))

<class 'int'> <class 'float'>


It is also possible to daisy-chain functions together. In the example below, lower() makes all the letters into lower case, while replace() replaces *all* instances of 'han' with 'c'.


In [7]:
name = 'Ethan Chang'
weird_name = name.lower().replace('han', 'c')
print(name)
print(weird_name)

Ethan Chang
etc ccg



### **Conversions** (Lec #4)

In Python, the type of a value can be changed very often, but sometimes you will be unable to change it back to its original format or value. 

Use functions to change the type of a value:
- int()
- float()
- str()
- list()

In [8]:
conv_list = [1, 2, 3, 4, 5]
print(type(conv_list), conv_list)

# changing the list to a str
conv_list_str = str(conv_list)
print(type(conv_list_str), conv_list_str)

# attempting to change the str back to its original list form
# note how it does not work
conv_list_revert = list(conv_list_str)
print(type(conv_list_revert), conv_list_revert)

<class 'list'> [1, 2, 3, 4, 5]
<class 'str'> [1, 2, 3, 4, 5]
<class 'list'> ['[', '1', ',', ' ', '2', ',', ' ', '3', ',', ' ', '4', ',', ' ', '5', ']']


---

## **Importing packages** <a id='packages'></a>

To import a package, use the *import* keyword.

From
- Using *from* will let you specify what functions to import from a package or file, use * to import everything
- When using *from*, you can use functions as if they were defined within the current file
    - For example, if you simply import datascience, you will need to add 'datascience' before every function from the package
        - datascience.Table() instead of just Table()

As
- Using *as* will let you rename a package or function
    - For example, you can import numpy as np, which will let you use np.array() instead of numpy.array()
    - This effectively shortens the number of characters you need to type
    - Some packages like numpy are commonly imported as np, so it is good practice to do always import numpy as np

In [9]:
import numpy as np
from datascience import *


## **Package: NumPy** <a id='numpy'></a>

### **NumPy Functions**

#### [np.array](https://numpy.org/doc/stable/reference/generated/numpy.array.html)
np.array('array_like object', dtype=None)

Arrays from NumPy are similar to the lists in Python, but they are not as flexible when it comes to the variability of data types it can hold at once. All elements within a NumPy array must be the same data type. NumPy will try to pick the data type that all values within the array can fit into (*without* removing any parts of an element).

In [10]:
# NumPy is unsure if the '2' can become an int or float, 
# so it converts all the elements to str
array_mixed1 = np.array([1, '2', 3, 4.0, 5])
print(array_mixed1.dtype, array_mixed1)

# Numpy tries to keep the information stored in the '4.0' float 
# so it converts all the other elements to float
array_mixed2 = np.array([1, 2, 3, 4.0, 5])
print(array_mixed2.dtype, array_mixed2)

# You can however force NumPy to convert all the elements to a specific type
array_mixed3 = np.array([1, 2, 3, 4, 5], dtype=str)
print(array_mixed3.dtype, array_mixed3)

<U32 ['1' '2' '3' '4.0' '5']
float64 [ 1.  2.  3.  4.  5.]
<U1 ['1' '2' '3' '4' '5']


#### [np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html?)

np.arange(start, stop, step)

np.arange creates an array that contains a sequence of numbers. The first argument is the starting number, the second argument is the stopping number, and the third argument is the step size. The stopping number is not included in the array.

In [11]:
# A normal array counting by 2s
np_arange = np.arange(1, 10, 2)
print(np_arange)

# A normal array counting by 2s, but backwards
np_arange = np.arange(9, 0, -2)
print(np_arange)

[1 3 5 7 9]
[9 7 5 3 1]



## **Package: datascience** <a id='datascience'></a>

### **Creating a table**


In [12]:
ca_table = Table().with_columns(
    'Label', ['Row 1', 'Row 2', 'Row 3', 'Row 4'],
    'Course Assistants', ['Ethan Chang', '?', '?', '?'],
    'Major', ['Data Science', 'N/A', 'N/A', 'N/A']
)

ca_table

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


### **Table Structure** (Lec #3)

If you run the code cell below, you will see a table with 4 rows and 3 columns. The name of a column (also known as a label) will always be at the top, and the entries for the table will be below it. 

In [13]:
ca_table

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


### **Table Operations**

#### t.select
t.select(*column_or_columns*)

The select function from the datascience package allows you to select specific rows in a table. You can specify either one column or multiple using the column names (in string format) or the index of the column (with an int).

'.select()' can *only* be used on a table, and it returns a *new* table with the specified columns

In [14]:
# single column selected
ca_select = ca_table.select('Course Assistants')
ca_select.show()

# two columns selected, the first instance of ca_select is being overwritten
ca_select = ca_table.select('Label', 'Major')
ca_select.show()

# selecting columns from the previous table (you should not need to do this)
ca_select = ca_select.select('Label')
ca_select.show()

Course Assistants
Ethan Chang
?
?
?


Label,Major
Row 1,Data Science
Row 2,
Row 3,
Row 4,


Label
Row 1
Row 2
Row 3
Row 4


#### t.drop
t.drop(*column_or_columns*)

As the name suggests, 'drop' will *remove* or drop the column(s). You can use either the column name or index to specify which columns you want to remove. Additionally, 'drop' will return a *new* table.

*Instead of selecting multiple columns just to exclude one, simply drop the column instead.*

In [15]:
ca_copy = ca_table.copy()
ca_copy.show()

ca_drop = ca_copy.drop('Label')
ca_drop.show()

# try to not do this, it gets difficult if your columns have long names
ca_copy.select('Course Assistants', 'Major').show()

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


Course Assistants,Major
Ethan Chang,Data Science
?,
?,
?,


Course Assistants,Major
Ethan Chang,Data Science
?,
?,
?,


#### t.sort
t.sort(*column_or_label, descending=False, distinct=False*)

'sort' will return a *new* table with the rows sorted according to the values in a column. Unless specified, 'sort' will always sort in ascending order, while also keeping non-unique values. Setting 'descending' to true will sort in descending order, and setting 'distinct' to true will remove non-unique values.

In [16]:
ca_sort = ca_table.copy()

# sorting in descending order, see how the Row # starts with 4 then goes to 1
ca_sort.sort('Label', descending=True).show()

# sorting while removing non-unique values, all but one row with '?' is gone
ca_sort.sort('Course Assistants', distinct=True).show()

Label,Course Assistants,Major
Row 4,?,
Row 3,?,
Row 2,?,
Row 1,Ethan Chang,Data Science


Label,Course Assistants,Major
Row 2,?,
Row 1,Ethan Chang,Data Science


#### t.where (Lec #3)

t.where(*column_or_label, value_or_predicate=None, other=None*)

'where' takes in up to three arguments: 
1. The column or label of the column you want to filter by, this can be a string or int (for the index of the column)
2. Value or predicate that you want to filter by
    - If you want to filter by a specific value, you can use a string, int, or float
    - If you want to filter by a predicate, you can use a function that returns a boolean value
3. Optional label for pairwise comparisons

There are a variety of predicates that can be used, [here](http://data8.org/datascience/predicates.html?highlight=#datascience.predicates.are) is a list of all of them with examples.

In [17]:
ca_where = ca_table.with_columns(
    'Extra', ['?'] * 4
)
print('Base Table')
ca_where.show()


# shows rows in the table where there is a '?' in the 'Course Assistants' column
# where(column, value) is what is being used
print('\n', 'Table #1')
ca_where.where('Course Assistants', 'Ethan Chang').show()


# shows rows in the table where values in 'Course Assistants' are equal to 'Extra'
# where(column, predicate, column) is being used, are.equal_to is a predicate
print('\n', 'Table #2')
ca_where.where('Course Assistants', are.equal_to, 'Extra').show()


# shows rows in the table where values in 'Extra' are equal to '?' (which is all values)
# where(column, predicate) is being used here
# this is similar to the first example with 'Ethan Chang', but this one uses are.equal_to()
print('\n', 'Table #3')
ca_where.where('Extra', are.equal_to('?')).show()

Base Table


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?



 Table #1


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?



 Table #2


Label,Course Assistants,Major,Extra
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?



 Table #3


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?


---

## **Understanding Documentation** <a id='documentation'></a>

Links to documentation for Python and the packages used or referenced in this course:
- [Python](https://docs.python.org/)
- [datascience](http://data8.org/datascience/)
- [NumPy](https://docs.scipy.org/doc/numpy/reference/)

Documentation's official definition is "material that provides official information or evidence or that serves as a record". For programming, documentation is a way for the creators of langauges and packages to provide information about their code. Additionally, when writing your own programs, you can add in documentaton to help others understand your code. 

The image below is a screenshot of the documentation for the 'Table.where()' function from the datascience package. It shows the function's name, parameters, and a description of what the function does. Furthermore, it also provides the restraints for the parameters, and the possible return values.

<img src="images/ds_where.png"><br>
<sup>Image taken from [the datascience documentation webpage](http://data8.org/datascience/_autosummary/datascience.tables.Table.where.html?)</sup><br>


For some packages, the documentation will also list examples of the function being used. This is a great way to learn how to use a function, and it is also a good way to see how the function works. Below are the examples for the 'Table.where()' function.

<img src="images/ds_where-examples.png"><br>
<sup>Image taken from [the datascience documentation webpage](http://data8.org/datascience/_autosummary/datascience.tables.Table.where.html?)</sup>


---

## **GitHub** <a id='github'></a>

GitHub is a website where people can store the code they wrote for specific projects. It also allows developers to collaborate together and identify bugs/issues. This guide is on GitHub, and if there is an problem with the guide or suggestion you have, you can create a new issue on GitHub to let the contributors know. 

If you are considering going into a field with software development intertwined into it in any form, understanding how to use Git and GitHub is essential.