# Introduction

Welcome to MSDS600! These from the expert code files will be examples that teach how to achive the learning outcomes for the week. This file covers a basic Python coding review if you need it.

# Reading list

Reading list (available through O'Reilly through the [library](https://libguides.regis.edu/computer_informationsciences)):

[1] [Python Data Science Essentials - Third Edition by Alberto Boschetti and Luca Massaron.](https://learning.oreilly.com/library/view/python-data-science/9781789537864/) 
Sections:
- First Steps (“First Steps” through “Alternatives to Jupyter”)
- Strengthen Your Python Foundations (“Strengthen your Python Foundations” through “Don’t be shy, take a real challenge”)

[2] [Python for Data Science For Dummies, 2nd Edition by John Paul Mueller and Luca Massaron.](https://learning.oreilly.com/library/view/python-for-data/9781119547624/)
Chapters 1 and 2.

See the end of the presentation for more resources on learning/brushing up on Python.

# Python review

Much of this is covered in the Python Data Science Essentials book and many other places. We will cover:
- variable types (ints, floats, strings, booleans, bytes)
- data structures (lists, tuples, sets, dictionaries, NumPy arrays, Pandas DataFrames)
- operators
- functions
- objects, classes, methods, and attributes
- loops and comprehensions
- conditional statements
- packages and modules
- keywords and built-in functions

## Variables and data types

In Python, there are a few key data types:
- integers (`int`)
- floats (`float`)
- strings (`str`)
- booleans (`bool`)
- bytes (`bytes`)
There are other data types as well, such as `complex` for complex numbers.

When we create a variable, the type is determined automatically. For example, we can store an integer like so:

In [1]:
an_integer = 1
an_integer

1

We are naming the variable with best practices here - `camel_case` (lowercase with underscores separating words) and a descriptive name.
We can check the type of a variable with the built-in `type()` function:

In [2]:
type(an_integer)

int

We can convert variable types with "casting". This converts an integer (`int`) to a string (`str`):

In [3]:
str(an_integer)

'1'

Floats have decimal places:

In [4]:
a_float = 1.0
type(a_float)

float

The usual math operators work (addition, subtraction, etc). Mixing floats and ints usually results in a float. The `//` operator is integer divion (rounds down), the `%` operator is the modulo (remainder from division). Exponentiation is `**`. More complex math operators can be found in the `numpy` package or built-in `math` module.

In [5]:
5 % 2

1

In [6]:
5 // 2

2

In [7]:
5 / 2

2.5

In [8]:
import numpy as np

np.log(10)

2.302585092994046

In [9]:
import math

math.sqrt(9)

3.0

Strings are a list of characters:

In [10]:
a_string = 'test string here'
a_string

'test string here'

Booloans are `True` or `False`:

In [11]:
a_bool = True
type(a_bool)

bool

Bytes objects are sometimes encountered when loading data:

In [12]:
a_bytes = b'bytes object'
type(a_bytes)

bytes

Convert the boolean object to an integer using casting and see what happens:

Convert the number 5.8 to a float and see what happens (converting to an integer always rounds down):

## Data structures

### Lists
The main data structure in Python is the list:

In [13]:
a_list = [1, 2, 3, 4]
type(a_list)

list

List indexing works like `[start:stop:step]`. The default is `[0:-1:1]`, which steps through the list from beginning to end one element at a time.

Python is "0-indexed", meaning the first element has index 0. R, by contrast, is 1-indexed. The `start` index is inclusive, the `stop` index is exclusive. Here is how to get the first element: 

In [14]:
a_list[0]

1

We can also 'slice' a list to get elements (0 through 2 here):

In [15]:
a_list[:2]

[1, 2]

We can get elements from the end with negative index numbers:

In [16]:
a_list[-3:-1]

[2, 3]

The step argument controls the number of steps. We can reverse a list with `[::-1]`:

In [17]:
# get every other element
a_list[::2]

[1, 3]

In [18]:
a_list[::-1]

[4, 3, 2, 1]

Lists have many built-in functions (a.k.a. methods): https://docs.python.org/3/tutorial/datastructures.html
append is a common one:

In [19]:
# adds 4 to the end of the list
a_list.append(4)
a_list

[1, 2, 3, 4, 4]

Lists can hold most anything, including other lists:

In [20]:
another_list = [[1, 2, 3], [5, 5, 5]]
another_list

[[1, 2, 3], [5, 5, 5]]

We can change elements of lists:

In [21]:
a_list[0] = 10

Lists can also be concatenated with the + operator:

In [22]:
a_list + another_list

[10, 2, 3, 4, 4, [1, 2, 3], [5, 5, 5]]

Use the `extend()` method of lists to add `another_list` on to the end of `a_list` (gives the same results as the + operator):

### Tuples
A tuple is like a list, but is "immutable", meaning you cannot change it:

In [23]:
a_tuple = tuple([1, 2, 3])
a_tuple

(1, 2, 3)

Try changing the value of the first element of `a_tuple` and see what happens (you will get an error):

### Sets
Sets are the mathematical type of set - a collection of unique values. They have several built-in functions and operators available: https://docs.python.org/3/library/stdtypes.html#set

In [24]:
a_set = set(a_list)
a_set

{2, 3, 4, 10}

In [25]:
another_set = {1, 2, 3, 3}
another_set

{1, 2, 3}

The operator `in` checks if something is in something else, and works well with sets. It returns a boolean:

In [26]:
4 in a_set

True

Use the union operator (the `|` symbol, usually under your backspace key) to join `a_set` and `another_set`:

### Dictionaries
Dictionaries have keys and values, and look similar to sets. There are many functions and related datatyes for dictionaries (e.g. OrderedDict): https://docs.python.org/3/library/stdtypes.html#dict

In [27]:
a_dict = {'key1': 'value_1', 'key_2': 12}
a_dict

{'key1': 'value_1', 'key_2': 12}

In [28]:
a_dict['another_key'] = 'hey'
a_dict

{'key1': 'value_1', 'key_2': 12, 'another_key': 'hey'}

In [29]:
'key1' in a_dict

True

In [30]:
a_dict.keys()

dict_keys(['key1', 'key_2', 'another_key'])

In [31]:
a_dict.values()

dict_values(['value_1', 12, 'hey'])

In [32]:
a_dict.items()

dict_items([('key1', 'value_1'), ('key_2', 12), ('another_key', 'hey')])

Add another item to `a_dict` with the key `key3` and value `hooray!`:

### NumPy arrays
NumPy is a Python package for numerical analysis. It has math operators and other classes/objects. A common one is the `array` object, which is like a list but can be multi-dimensional:

In [33]:
import numpy as np

an_array = np.array([[1, 2, 3], [5, 5, 5]])
# indexing is [rows, columns]
an_array[:, 1]  # get all rows and the second column

array([2, 5])

We're first importing the library with an alias of `np`, which is the conventional way to import this. Then we create an array and index it to get the second column and all rows.

Get the last row and all columns of `an_array`:

### Pandas DataFrames and Series
A common data handling package in Python is pandas, which uses NumPy to hold data. We can easily make a DataFrame (similar to a spreadsheet) from a dictionary:

In [34]:
import pandas as pd

df = pd.DataFrame(data={'people': [5, 2, 3], 'revenue': [10, 1, 12]})
df

Unnamed: 0,people,revenue
0,5,10
1,2,1
2,3,12


If we get a single column from the DataFrame, it's a pandas Series:

In [35]:
df['people']

0    5
1    2
2    3
Name: people, dtype: int64

In [36]:
type(df['people'])

pandas.core.series.Series

In [37]:
type(df)

pandas.core.frame.DataFrame

Pandas is a crucial part of the data science Python technology stack, so you should spend time learning it. There are DataCamp, DataQuest, Kaggle, and other courses coving pandas. There is also a book used in MSDE620, *Pandas for Everyone*, which is quite good, as well as several other pandas books available through O'Reilly through the library.

Get the revenue column from the DataFrame:

## Functions

Functions are crucial in programming; we can write our own functions to avoid repeating ourselves and others have written packages with functions in them to make it easier to do common things. We can use some built-in functions like so:

In [38]:
# get the length of something
len(a_list)

5

In [39]:
print(a_list)

[10, 2, 3, 4, 4]


Functions have a name, then parentheses, then take some arguments. The arguments can be positional, named, and so on. For example, the documentation for the `sorted()` function looks like this:

`sorted(iterable, /, *, key=None, reverse=False)`

The first argument, iterable, is a positional-only argument. The `/` designates that anything before it is positional-only (e.g. we cannot provide a name like `sorted(iterable=a_list)`, but should do `sorted(a_list)`). The `*` means anything after it is keyword-only and cannot be positional. Keyword arguments have the name, then the value: `sorted(a_list, reverse=True)`.

In [40]:
sorted(a_list, reverse=True)

[10, 4, 4, 3, 2]

We make our own functions with the `def` keyword, the function name, then the arguments in parentheses, and a colon at the end. Default values for arguments can be set with an equals sign, like `b=12`. The function body is indented (most people use the tab key, which is converted to 4 spaces). When the indentation ends, so does the function. We can use the `return` keyword to return values from the function. Our custom functions are used just like the built-in or pre-built functions. We can provide arguments by name or position here:

In [41]:
def test_function(a, b=12):
    return a + b

test_function(50)

62

In [42]:
test_function(a=50)

62

Use `test_function` to add 1 and 1 with named arguments:

## Objects and classes

Everything in Python is an object. These can have attributes. Attributes can be functions (methods) or values. For example, our pandas DataFrame and NumPy arrays have an attribute `shape` which is a tuple telling us the number of rows and columns:

In [43]:
df.shape

(3, 2)

DataFrames have lots of attributes and methods. One method is `sort_values`:

In [44]:
df.sort_values('people', ascending=False)

Unnamed: 0,people,revenue
0,5,10
2,3,12
1,2,1


There are ways to make our own classes, which is like making our own functions, but more advanced. It's beyond our scope here, but there are several books and online courses and tutorials that cover classes in Python, such as *Modern Python Cookbook - Second Edition* by Steven Lott available through the library, and the official documentation: https://docs.python.org/3/tutorial/classes.html

In Juptyer Notebooks and IPython, you can type a variable name, then the period, then press 'tab' to see what attributes are available. Try it with df:

## Scoping

One important topic is scoping - variables within functions or classes are not available outside of the functions or classes unless we declare them as globals. However, using global variables is bad practice and should be avoided. Here is an example of scoping: we cannot access the variable inside the function outside of it. We get a `NameError` because the variable does not exist outside of the function.

In [45]:
def scoping_example():
    new_var = 123
    return new_var

scoping_example()

new_var

NameError: name 'new_var' is not defined

Modify the function above so it prints out the `type` of `new_var` within the function. Also modify the code above so it doesn't result in an error.

## Loops

Along with lists, loops are another key part of Python. We can use `for` and `while` loops. `for` goes through an iterable like a list, `while` keeps going till be break it. We use the word `for`, then a variable name, then `in`, then an iterable like a list, then a colon. On the next lines, we indent them. When the indentation stops, so does the loop.

In [46]:
for i in [1, 2, 3]:
    print(i)

1
2
3


In [47]:
for i in [1, 2, 3]:
    print(i)
    break

1


Loops have some keywords: `break` and `continue`. `break` stops the loop and exits it, `continue` goes to the next iteration.

While loops keep running until we use `break` or the condition is broken. We use the word `while`, then give a boolean, then a colon. Anything indented on the next lines is in the `while` loop.

In [48]:
i = 0
while i < 10:
    print(i)
    i += 1

0
1
2
3
4
5
6
7
8
9


The `+=` operator is the same as using `a = a + 1`.

We often use the `range` and `len` functions with loops. Here we loop through a list and get the index of each value in the list, then print each value:

In [49]:
for i in range(len(a_list)):
    print(a_list[i])

10
2
3
4
4


We can also use the `zip` function to join two iterables together:

In [50]:
for i, j in zip(range(10), range(10, 20)):
    print(i, j)

0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19


Try making a `for` loop to loop through a `range` from 5 to 10 and print out the numbers 5 to 10: 

## Conditionals

Along with loops, we can use conditions to branch our code. This is usually `if-elif-else` statements. We use comparisons or booleans to test and choose which branch to go down:

In [51]:
i = 10
if i == 10:
    print('i is 10')
elif i > 10:
    print('i is big')
else:
    print('i is small')

i is 10


We can use only the `if` by itself, or `if` with `else`, or include as many `elif`s as we want. Try making an if statement to check if the length of `a_list` is greater than 10, and print out something telling us about the length of the list:

## Packages and modules

A module is a Python file, like module.py. A package is a collecion of modules, like pandas and numpy. We import them like so:

In [52]:
import math

We can also import a module from a package:

In [53]:
from numpy import warnings

In [54]:
warnings



We can also import specific variables or functions from modules or packages:

In [55]:
from math import ceil

We can change the name of an import with aliases: 

In [56]:
from math import ceil as c

Don't do a global import like this, it makes it hard to know where functions or variables came from (making the code harder to read and understand):

In [57]:
# don't do this!
from numpy import *

Import the function `allclose` from `numpy` and alias it as `ac`:

There is an easter egg in Python. Try running `import this`.

## Keywords and built-in functions

In Python there are several keywords and built-in functions. We shouldn't name variables, functions, or classes the same thing as these keywords. We already saw some of these, and you may notice they turn green in Jupyter Notebook or IPython. We are not using all of these properly here, but you can see they are turning green. The `None` object is a special one - if a function returns nothing and we store it in a variable, the variable will be `None`.

In [64]:
None

In [58]:
pass

In [59]:
continue

SyntaxError: 'continue' not properly in loop (<ipython-input-59-6ca52a340915>, line 1)

In [60]:
break

SyntaxError: 'break' outside loop (<ipython-input-60-6aaf1f276005>, line 1)

In [61]:
for

SyntaxError: invalid syntax (<ipython-input-61-4c4cefd7ff81>, line 1)

In [62]:
range

range

In [63]:
in

SyntaxError: invalid syntax (<ipython-input-63-9d26586a7869>, line 1)

Here is a keyword list, and list of built-in functions in Python:
- https://www.programiz.com/python-programming/keyword-list
- https://docs.python.org/3/library/functions.html

## Getting help

When we come across an error, there are a few ways to start off to get help:
- ? or help() to check the documentation
- documentation through an internet search engine
- internet search engine with the error

For example, if we try to access a variable that doesn't exist, we get the error:

In [None]:
not_a_var

The top part of the error is the 'traceback' - it steps through the code used and all the modules/python files and functions. At the end of the error, there is something like `NameError` or another CamelCase error, a colon, then the error. Here, it is `NameError: name 'not_a_var' is not defined`. We can copy-paste this into a search engine which will help. Often it will take us to stackoverflow, which is a very helpful site for figuring out what's going on. We also may find answers on GitHub issues for packages.

## Coding Style

The book *Clean Code in Python 2nd Edition* by Mariano Anaya has several principles on best practices for coding. Remember that most likely you will be the next person to read your code, so you want to make it easy to understand later. This is especially important if you are working on a bigger project you will be working on for months or years. Some best practices are:
- naming variables, functions, and classes clearly so they are easy to understand
  - variables and functions should be `snake_case`, classes should be `CamelCase` 
- following PEP8 standards (https://www.python.org/dev/peps/pep-0008/, you can use the `autopep8` package to clean up your code if needed)
- using version control like Git with GitHub or GitLab
- breaking up redundant pieces of code into functions (DRY, do not repeat yourself)
- writing documentation for your functions

# End of review
That's the end of the review here. See the other FTE Jupyter file for a walkthrough of EDA and creating visualizations.