# **The DS-100 Guide**


## Table of Contents
- [The Shared Computer Cluster (SCC)](#scc)
- [Vanilla Python](#python)
- [Packages](#packages)
    - [NumPy](#numpy)
    - [datascience](#datascience)
- [GitHub](#github)

---
## **The Shared Computing Cluster (SCC)** <a id='scc'></a>

The SCC is what all students in DS-100 will be using to run the files required for their assignments. It is a diverse Linux cluster of numerous components including over 9,000 shared CPU cores, 250 GPU cores and 1 petabyte of storage. The SCC is mainly used for tasks that require high-performance computing from disciplines such as engineering, biostatistics, and machine learning. 


### **Accessing the SCC**

1. Ensure that you are not on Boston University’s “BU (802.1x)” wifi network
2. Go to [the SCC site](https://scc-ondemand.bu.edu)
3. Login to the SCC with you kerberos username and password (the account you use for the student link)
4. Optional: Save/bookmark this page as you will be accessing it often


---
## **Python** <a id='python'></a>


### **Data Types**

#### Int (integer) (Lec #4)

Ints in Python can be an integer of any size. Additionally, it never has a decimal point.

In [13]:
x_int = 1234

#### Float (int w/ decimal point) (Lec #4)

Floats always have a decimal point, but they have an optional fractional part. They also have a limited size and precision of 15-16 decimal places. Floats **can be wrong** in the final few decimal places after arithmetic (due to how they are stored in memory, see __[here](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate)__).

In [14]:
x_float = 12.34

#### Str (Strings & text) (Lec #4)

#### Arrays/Lists 


### **Assignment Statements** (Lec #3)

Assignment statements, denoted by a *single* =, are a type of statement that binds a variable to a value. Additionally, variable names are *case-sensitive* (i.e. 'A' is not equal to 'a').

In [15]:
hw_due_date = 'On Thursday'
print(type(hw_due_date))

# The value and type can be overwritten

hw_due_date = ['On Thursday']
print(type(hw_due_date))

a = 0
A = 1
print(a == A)

<class 'str'>
<class 'list'>
False



### **Using Functions** (Lec #3)

A function is a block of code that will run whenever it is called. The variable listed inside of the parentheses are paramters, while the actual value that is sent to the function are arguments. Additionally, some functions can take *multiple* arguments (i.e. np.array(), max(), min(), etc).

In [16]:
# From Lec #4
# type() is a function, while x_int and x_float are parameters in this case
# print() is also a function, and it can take more than one argument

print(type(x_int), type(x_float))

<class 'int'> <class 'float'>


It is also possible to daisy-chain functions together. In the example below, lower() makes all the letters into lower case, while replace() replaces *all* instances of 'han' with 'c'.


In [17]:
name = 'Ethan Chang'
weird_name = name.lower().replace('han', 'c')
print(name)
print(weird_name)

Ethan Chang
etc ccg



### **Conversions** (Lec #4)

---
## **Importing packages** <a id='packages'></a>

In [18]:
import numpy as np
from datascience import *


## **Package: NumPy** <a id='numpy'></a>

### **NumPy Functions**

#### np.array

#### np.arange

#### np.diff


## **Package: datascience** <a id='datascience'></a>

### **Creating a table**


In [19]:
ca_table = Table().with_columns(
    'Label', ['Row 1', 'Row 2', 'Row 3', 'Row 4'],
    'Course Assistants', ['Ethan Chang', '?', '?', '?'],
    'Major', ['Data Science', 'N/A', 'N/A', 'N/A']
)

ca_table

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


### **Table Structure** (Lec #3)

If you run the code cell below, you will see a table with 4 rows and 3 columns. The name of a column (also known as a label) will always be at the top, and the entries for the table will be below it. 

In [20]:
ca_table

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


### **Table Operations**

#### t.select(*column_or_columns*) (Lec #3)
The select function from the datascience package allows you to select specific rows in a table. You can specify either one column or multiple using the column names (in string format) or the index of the column (with an int).

'.select()' can *only* be used on a table, and it returns a *new* table with the specified columns

In [21]:
# single column selected
ca_select = ca_table.select('Course Assistants')
ca_select.show()

# two columns selected, the first instance of ca_select is being overwritten
ca_select = ca_table.select('Label', 'Major')
ca_select.show()

# selecting columns from the previous table (you should not need to do this)
ca_select = ca_select.select('Label')
ca_select.show()

Course Assistants
Ethan Chang
?
?
?


Label,Major
Row 1,Data Science
Row 2,
Row 3,
Row 4,


Label
Row 1
Row 2
Row 3
Row 4


#### t.drop(*column_or_columns*) (Lec #3)

As the name suggests, 'drop' will *remove* or drop the column(s). You can use either the column name or index to specify which columns you want to remove. Additionally, 'drop' will return a *new* table.

*Instead of selecting multiple columns just to exclude one, simply drop the column instead.*

In [22]:
ca_copy = ca_table.copy()
ca_copy.show()

ca_drop = ca_copy.drop('Label')
ca_drop.show()

# try to not do this, it gets difficult if your columns have long names
ca_copy.select('Course Assistants', 'Major').show()

Label,Course Assistants,Major
Row 1,Ethan Chang,Data Science
Row 2,?,
Row 3,?,
Row 4,?,


Course Assistants,Major
Ethan Chang,Data Science
?,
?,
?,


Course Assistants,Major
Ethan Chang,Data Science
?,
?,
?,


#### t.sort(*column_or_label, descending=False, distinct=False*) (Lec #3)

'sort' will return a *new* table with the rows sorted according to the values in a column. Unless specified, 'sort' will always sort in ascending order, while also keeping non-unique values. Setting 'descending' to true will sort in descending order, and setting 'distinct' to true will remove non-unique values.

In [23]:
ca_sort = ca_table.copy()

# sorting in descending order, see how the Row # starts with 4 then goes to 1
ca_sort.sort('Label', descending=True).show()

# sorting while removing non-unique values, all but one row with '?' is gone
ca_sort.sort('Course Assistants', distinct=True).show()

Label,Course Assistants,Major
Row 4,?,
Row 3,?,
Row 2,?,
Row 1,Ethan Chang,Data Science


Label,Course Assistants,Major
Row 2,?,
Row 1,Ethan Chang,Data Science


#### t.where(*column_or_label, value_or_predicate=None, other=None*) (Lec #3)

'where' takes in up to three arguments: 
- the first one being a column of the table as either a label/name (str) or an index (int)
- the second being a function that is applied to every value in the specified column and keeps values that are True, or if value_or_predicate is a single value, only rows that are equal to the value in the specified column are kept
- the third being an optional label to make pairwise comparisons

There are a variety of predicates that can be used [here](http://data8.org/datascience/predicates.html?highlight=#datascience.predicates.are) is a list of all of them with examples.

In [24]:
ca_where = ca_table.with_columns(
    'Extra', ['?' for i in range(4)]
)
print('Base Table')
ca_where.show()


# shows rows in the table where there is a '?' in the 'Course Assistants' column
# where(column, value) is what is being used
print('\n', 'Table #1')
ca_where.where('Course Assistants', 'Ethan Chang').show()

# shows rows in the table where values in 'Course Assistants' are equal to 'Extra'
# where(column, predicate, column) is being used, are.equal_to is a predicate
print('\n', 'Table #2')
ca_where.where('Course Assistants', are.equal_to, 'Extra').show()

# shows rows in the table where values in 'Extra' are equal to '?' (which is all values)
# where(column, predicate) is being used here
# this is similar to the first example with 'Ethan Chang', but this one uses are.equal_to()
print('\n', 'Table #3')
ca_where.where('Extra', are.equal_to('?')).show()

Base Table


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?



 Table #1


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?



 Table #2


Label,Course Assistants,Major,Extra
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?



 Table #3


Label,Course Assistants,Major,Extra
Row 1,Ethan Chang,Data Science,?
Row 2,?,,?
Row 3,?,,?
Row 4,?,,?


---
## **GitHub** <a id='github'></a>

GitHub is a website where people can store the code they wrote for specific projects. It also allows developers to collaborate together and identify bugs/issues. This guide is on GitHub, and if there is an problem with the guide or suggestion you have, you can create a new issue on GitHub to let the contributors know. 

If you are considering going into a field with software development intertwined into it in any form, it would be essential to learn how to use Git and GitHub, or one of their alternatives.