# Big Cheat Sheet

This file summarizes all the coding concepts learned from DataCamp in MA346, as well as those learned in CS230 that remain important in MA346.  It is broken into sections in the order in which we encounter the topics in the course, and the course schedule on [the main page](intro) links to each section from the day on which it's learned.

---

## Before Week 2: Review of CS230

---

### [Introduction to Python](https://www.datacamp.com/courses/intro-to-python-for-data-science) (optional, basic review)

#### Chapter 1: Python Basics

Comments, which are not executed:
```python
# Start with a hash, then explain your code.
```

Print simple data:
```python
print( 1 + 5 )
```

Storing data in a variable:
```python
num_friends = 1000
```

Integers and real numbers ("floating point"):
```python
0, 20, -3192, 16.51309, 0.003
```

Strings:
```python
"You can use double quotes."
'You can use single quotes.'
'Don\'t forget backslashes when needed.'
```

Booleans:
```python
True, False
```

Asking Python for the type of a piece of data:
```python
type( 5 ), type( "example" ), type( my_data )
```

Converting among data types:
```python
str( 5 ), int( "-120" ), float( "0.5629" )
```

Basic arithmetic ($+$, $-$, $\times$, $\div$):
```python
1 + 2, 1 - 2, 1 * 2, 1 / 2
```

Exponents, integer division, and remainders:
```python
1 ** 2, 1 // 2, 1 % 2
```

#### Chapter 2: Python Lists

Create a list with square brackets:
```python
small_primes = [ 2, 3, 5, 7, 11, 13, 17, 19, 23 ]
```

Lists can mix data of any type, even other lists:
```python
# Sublists are name, age, height (in m)
heroes = [ [ 'Harry Potter', 11, 1.3 ],
           [ 'Ron Weasley', 11, 1.5 ],
           [ 'Hermione Granger', 11, 1.4 ] ]
```

Accessing elements from the list is zero-based:
```python
small_primes[0]   # == 2
small_primes[-1]  # == 23
```

Slicing lists is left-inclusive, right-exclusive:
```python
small_primes[2:4] # == [5,7]
small_primes[:4]  # == [2,3,5,7]
small_primes[4:]  # == [11,13,17,19,23]
```

It can even use a "stride" to count by something other than one:
```python
small_primes[0:7:2]     # selects items 0,2,4,6
small_primes[::3]       # selects items 0,3,6
small_primes[::-1]      # selects all, but in reverse
```

If indexing gives you a list, you can index again:
```python
heroes[1][0]        # == 'Ron Weasley'
```

Modify an item in a list, or a slice all at once:
```python
some_list[5] = 10
some_list[5:10] = [ 'my', 'new', 'entries' ]
```

Adding or removing entries from a list:
```python
small_primes += [ 27, 29, 31 ]
small_primes = small_primes + [ 37, 41 ]
small_primes.append( 43 )  # to add just one entry
del( heroes[0] )    # Voldemort's goal
del( heroes[:] )    # or, even better, this
```

Copying or not copying lists:
```python
# L will refer to the same list in memory as heroes:
L = heroes
# M will refer to a full copy of the heroes array:
M = heroes[:]
```

#### Chapter 3: Functions and Packages

Calling a function and saving the result:
```python
lastSmallPrime = max( small_primes )
```

Getting help on a function:
```python
help( max )
```

Methods are functions that belong to an object.  (In Python, every piece of data is an object.)

Examples:
```python
name = 'jerry'
name.capitalize()             # == 'Jerry'
name.count( 'r' )             # == 2
flavors = [ 'vanilla', 'chocolate', 'strawberry' ]
flavors.index( 'chocolate' )  # == 1
```

Installing a package from conda:
```bash
conda install package_name
```

Ensuring conda forge packages are available:
```bash
conda config --add channels conda-forge
```

Installing a package from pip:
```bash
pip3 install package_name
```

Importing a package and using its contents:
```python
import math
print( math.pi )
# or if you'll use it a lot and want to be brief:
import math as M
print( M.pi )
```

Importing just some functions from a package:
```python
from math import pi, degrees
print( "The value of pi in degrees is:" )
print( degrees( pi ) )        # == 180.0
```

#### Chapter 4: NumPy

Creating NumPy arrays from Python lists:
```python
import numpy as np
a = np.array( [ 5, 10, 6, 3, 9 ] )
```

Elementise computations are supported:
```python
a * 2       # == [ 10, 20, 12, 6, 18 ]
a < 10      # == [ True, False, True, True, True ]
```

Use comparisons to subset/select:
```python
a[a < 10]   # == [ 5, 6, 3, 9 ]
```

Note: NumPy arrays don't permit mixing data types:
```python
np.array( [ 1, "hi" ] )  # converts all to strings
```

NumPy arrays can be 2d, 3d, etc.:
```python
a = np.array( [ [ 1, 2, 3, 4 ],
                [ 5, 6, 7, 8 ] ] )
a.shape     # == (2,4)
```

You can index/select with comma notation:
```python
a[1,3]      # == 8
a[0:2,0:2]  # == [[1,2],[5,6]]
a[:,2]      # == [3,7]
a[0,:]      # == [1,2,3,4]
```

Fast NumPy versions of Python functions, and some new ones:
```python
np.sum( a )
np.sort( a )
np.mean( a )
np.median( a )
np.std( a )
# and others
```

---

### [Python Data Science Toolbox, Part 1](https://www.datacamp.com/courses/python-data-science-toolbox-part-1) (optional, basic review)

#### Chapter 1: Writing your own functions

Tuples are like lists, but use parentheses, and are immutable.

```python
t = ( 6, 1, 7 )        # create a tuple
t[0]                   # == 6
a, b, c = t            # a==6, b==1, c==7
```

Syntax for defining a function:

(A function that modifies any global variables needs the Python `global` keyword inside to identify those variables.)
```python
def function_name ( arguments ):
    """Write a docstring describing the function."""
    # do some things here.
    # note the indentation!
    # and optionally:
    return some_value
    # to return multiple values: return v1, v2
```

Syntax for calling a function:

(Note the distinction between "arguments"  and "parameters.")
```python
# if you do not care about a return value:
function_name( parameters )
# if you wish to store the return value:
my_variable = function_name( parameters )
# if the function returns multiple values:
var1, var2 = function_name( parameters )
```

#### Chapter 2: Default arguments, variable-length arguments, and scope

Defining nested functions:
```python
def multiply_by ( x ):
    """Creates a function that multiplies by x"""
    def result ( y ):
        """Multiplies x by y"""
        return x * y
    return result
# example usage:
df["height_in_inches"].apply(
    multiply_by( 2.54 ) )  # result is now in cm
```

Providing default values for arguments:
```python
def rand_between ( a=0, b=1 ):
    """Gives a random float between a and b"""
    return np.random.rand() * ( b - a ) + a
```

Accepting any number of arguments:
```python
def commas_between ( *args ):
    """Returns the args as a string with commas"""
    result = ""
    for item in args:
        result += ", " + str(item)
    return result[2:]
commas_between(1,"hi",7)    # == "1,hi,7"
```

Accepting a dictionary of arguments:
```python
def inverted ( **kwargs ):
    """Interchanges keys and values in a dict"""
    result = {}
    for key, value in kwargs.items():
        result[value] = key
    return result
inverted( jim=42, angie=9 )
        # == { 42 : 'jim', 9 : 'angie' }
```

#### Chapter 3: Lambda functions and error handling

Anonymous functions:
```python
lambda arg1, arg2: return_value_here
# example:
lambda k: k % 2 == 0    # detects whether k is even
```

Some examples in which anonymous functions are useful:
```python
list( map( lambda k: k%2==0, [1,2,3,4,5] ) )
                   # == [False,True,False,True,False]
list( filter( lambda k: k%2==0, [1,2,3,4,5] ) )
                   # == [2,4]
reduce( lambda x, y: x*y, [1,2,3,4,5] )
                   # == 120 (1*2*3*4*5)
```

Raising errors if users call your functions incorrectly:
```python
# You can detect problems in advance:
def factorial ( n ):
    if type( n ) != int:
        raise TypeError( "n must be an int" )
    if n < 0:
        raise ValueError( "n must be nonnegative" )
    return reduce( lambda x,y: x*y, range( 2, n+1 ) )

# Or you can let Python detect them:
def solve_equation ( a, b ):
    """Solves a*x+b=0 for x"""
    try:
        return -b / a
    except:
        return None
solve_equation( 2, -1 )    # == 0.5
solve_equation( 0, 5 )     # == None
```

---

### [Intermediate Python](https://www.datacamp.com/courses/intermediate-python-for-data-science) (required review)

#### Chapter 1: Matplotlib

Conventional way to import matplotlib:
```python
import matplotlib.pyplot as plt
```

Creating a line plot:
```python
plt.plot( x_data, y_data )     # create plot
plt.show()                     # display plot
```

Creating a scatter plot:
```python
plt.scatter( x_data, y_data )  # create plot
plt.show()                     # display plot
# or this alternative form:
plt.plot( x_data, y_data, kind='scatter' )
plt.show()
```

Labeling axes and adding title:
```python
plt.xlabel( 'x axis label here' )
plt.ylabel( 'y axis label here' )
plt.title( 'Title of Plot' )
```

#### Chapter 2: Dictionaries & Pandas

Creating a dictionary directly:
```python
days_in_month = {
    "january"  : 31,
    "february" : 28,
    "march"    : 31,
    "april"    : 30,
    # and so on, until...
    "december" : 31
}
```

Getting and using keys:
```python
days_in_month.keys()    # == ["january",
                        #     "february",...]
days_in_month["april"]  # == 30
```

Updating dictionary and checking membership:
```python
days_in_month["february"] = 29   # update for 2020
"tuesday" in days_in_month       # == False
days_in_month["tuesday"] = 9     # a mistake
"tuesday" in days_in_month       # == True
del( days_in_month["tuesday"] )  # delete mistake
"tuesday" in days_in_month       # == False
```

Build manually from dictionary:
```python
import pandas as pd
df = pd.DataFrame( {
    "column label 1": [
        "this example uses...",
        "string data here."
    ],
    "column label 2": [
        100.65,  # and numerical data
        -92.04   # here, for example
    ]
    # and more columns if needed
} )
df.index = [
    "put your...",
    "row labels here."
]
```

Import from CSV file:
```python
# if row and column headers are in first row/column:
df = pd.read_csv( "/path/to/file.csv",
                  index_col = 0 )
# if no row headers:
df = pd.read_csv( "/path/to/file.csv" )
```

Indexing and selecting data:

```python
df["column name"]    # is a "Series" (labeled column)
df["column name"].values()
                     # extract just its values
df[["column name"]]  # is a 1-column dataframe
df[["col1","col2"]]  # is a 2-column dataframe
df[n:m]              # slice of rows, a dataframe
df.loc["row name"]   # is a "Series" (labeled column)
                     # yes, the row becomes a column
df.loc[["row name"]] # 1-row dataframe
df.loc[["r1","r2","r3"]]
                     # 3-row dataframe
df.loc[["r1","r2","r3"],:]
                     # same as previous
df.loc[:,["c1","c2","c3"]]
                     # 3-column dataframe
df.loc[["r1","r2","r3"],["c1","c2"]]
                     # 3x2 slice of the dataframe
df.iloc[[5]]         # is a "Series" (labeled column)
                     # contains the 6th row's data
df.iloc[[5,6,7]]     # 3-row dataframe (6th-8th)
df.iloc[[5,6,7],:]   # same as previous
df.iloc[:,[0,4]]     # 2-column dataframe
df.iloc[[5,6,7],[0,4]]
                     # 3x2 slice of the dataframe
```

#### Chapter 3: Logic, Control Flow, and Filtering

Python relations work on NumPy arrays and Pandas Series:
```python
<, <=, >, >=, ==, !=
```

Logical operators can combine the above relations:
```python
and, or, not          # use these on booleans
np.logical_and(x,y)   # use these on numpy arrays
np.logical_or(x,y)    # (assuming you have imported
np.logical_not(x)     # numpy as np)
```

Filtering Pandas DataFrames:
```python
series = df["column"]
filter = series > some_number
df[filter]  # new dataframe, a subset of the rows
# or all at once:
df[df["column"] > some_number]
# combining multiple conditions:
df[np.logical_and( df["population"] > 5000,
                   df["area"] < 1250 )]
```

Conditional statements:
```python
# Take an action if a condition is true:
if put_condition_here:
    take_an_action()
# Take a different action if the condition is false:
if put_condition_here:
    take_an_action()
else:
    do_this_instead()
# Consider multiple conditions:
if put_condition_here:
    take_an_action()
elif other_condition_here:
    do_this_instead()
elif yet_another_condition:
    do_this_instead2()
else:
    finally_this()
```

#### Chapter 4: Loops

Looping constructs:
```python
while some_condition:
    do_this_repeatedly()
    # as many lines of code here as you like.
    # note that indentation is crucial!
    # be sure to work towards some_condition
    # becoming false eventually!

for item in my_list:
    do_something_with( item )

for index, item in enumerate( my_list ):
    print( "item " + str(index) +
           " is " + str(item) )

for key, value in my_dict.items():
    print( "key " + str(key) +
           " has value " + str(value) )

for item in my_numpy_array:
    # works if the array is one-dimensional
    print( item )

for item in np.nditer( my_numpy_array ):
    # if it is 2d, 3d, or more
    print( item )

for column_name in my_dataframe:
    work_with( my_dataframe[column_name] )

for row_name, row in my_dataframe.iterrows():
    print( "row " + str(row_name) +
           " has these entries: " + str(row) )

# in dataframes, sometimes you can skip the for loop:
my_dataframe["column"].apply( function )  # a Series
```

#### Chapter 5: Case Study: Hacker Statistics

Uniform random numbers from NumPy:
```python
np.random.seed( my_int )  # choose a random sequence
# (seeds are optional, but ensure reproducibility)
np.random.rand()          # uniform random in [0,1)
np.random.randint(a,b)    # uniform random in a:b
```


---

### [pandas Foundations](https://www.datacamp.com/courses/pandas-foundations) (required review)

#### Chapter 1: Data ingestion & inspection

Basic DataFrame/Series tools:

```python
df.head(5)           # first five rows
df.tail(5)           # last five rows
series.head(5)       # head, tail also work on series
df.info()            # summary of the data types used
```

Adding details to reading DataFrames from CSV files:

```python
# if no column headers:
df = pd.read_csv( "/path/to/file.csv",
                  index_col = 0, header = None,
                  names = ['column','names','here'] )
# if any missing data you want to mark as NaN:
# (na_values can be a list of patterns,
# or a dict mapping column names to patterns/lists)
df = pd.read_csv( "/path/to/file.csv",
                  na_values = 'pattern to replace' )
# and many other options!  (see the documentation)
```

To get a DataFrame with a date/time index:
```python
# read as dates any columns that pandas can:
df = pd.read_csv( "/path/to/file.csv",
                  parse_dates = True )
# read as dates just the columns you specify:
df = pd.read_csv( "/path/to/file.csv",
                  parse_dates = ['column','names'] )
# to use one of those columns as a date/time index:
df = pd.read_csv( "/path/to/file.csv",
                  parse_dates = True,
                  index_col = 'Date' )
# combine multiple columns to form a date:
df = pd.read_csv( "/path/to/file.csv",
                  parse_dates = [[column,indices]] )
```

Export to CSV or XLSX file:
```python
df.to_csv( "/path/to/output_file.csv" )
df.to_excel( "/path/to/output_file.xlsx" )
```

You can also create a plot from a `Series` or dataframe:
```python
df.plot()            # or series.plot()
plt.show()
# or to show each column in a subplot:
df.plot( subplots = True )
plt.show()
# or to plot certain columns:
df.plot( x='col name', y='other col name' )
plt.show()
```

A few small ways to customize plots:
```python
plt.xscale( 'log' )
plt.yticks( [ 0, 5, 10, 20 ] )
plt.grid()
```

To create a histogram:
```python
plt.hist( data, bins=10 )      # 10 is the default
plt.show()
```

To "clean up" so you can start a new plot:
```python
plt.clf()
```

Write text onto a plot:
```python
plt.text( x, y, 'Text to write' )
```

To save a plot to a file:
```python
# before plt.show(), call:
plt.savefig( 'filename.png' )  # or .jpg or .pdf
```

---

### [Manipulating DataFrames with pandas](https://www.datacamp.com/courses/manipulating-dataframes-with-pandas) (required review)

#### Chapter 1: Extracting and transforming data

(This builds on the DataCamp Intermediate Python section.)
```python
df.iloc[5:7,0:4]     # select ranges of rows/columns
df.iloc[:,0:4]       # select a range, all rows
df.iloc[[5,6],:]     # select a range, all columns
df.iloc[5:,:]        # all but the first five rows
df.loc['A':'B',:]    # colons can take row names too
                     # (but include both endpoints)
df.loc[:,'C':'D']    # ...also column names
df.loc['D':'A':-1]   # rows by name, reverse order
```

(This builds on the DataCamp Intermediate Python section.)
```python
# avoid using np.logical_and with & instead:
df[(df["population"] > 5000)
 & (df["area"] < 1250 )]
# avoid using np.logical_or with | instead:
df[(df["population"] > 5000)
 | (df["area"] < 1250 )]
# filtering for missing values:
df.loc[:,df.all()]  # only columns with no zeroes
df.loc[:,df.any()]  # only columns with some nonzero
df.loc[:,df.isnull().any()]
                    # only columns with a NaN entry
df.loc[:,df.notnull().all()]
                    # only columns with no NaNs
df.dropna( how='any' )
                    # remove rows with any NaNs
df.dropna( how='all' )
                    # remove rows with all NaNs
```

You can filter one column based on another using these tools.

Apply a function to each value, returning a new DataFrame:
```python
def example ( x ):
    return x + 1
df.apply( example )   # adds 1 to everything
df.apply( lambda x: x + 1 )    # same
# some functions are built-in:
df.floordiv( 10 )
# many operators automatically repeat:
df['total pay'] = df['salary'] + df['bonus']
# to extend a dataframe with a new column:
df['new col'] = df['old col'].apply( f )
# slightly different syntax for the index:
df.index = df.index.map( f )
```

You can also map columns through `dict`s, not just functions.
