# <span style = "color:rebeccapurple">Python and Machine Learning Review</span>

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

## <span style = "color:darkorchid">What do I assume you know?</span>

This workshop is an introduction to `scikit-learn`, not to programming. Hence, we will assume you have some familiarity with the following:

#### Machine learning preliminaries
This `scikit-learn` workshop will help you implement machine learning in your projects. However, we cannot be comprehensive in such a short course, in particular, we don't have much time to introduce foundational concepts in machine learning. Hence, I will assume you have <b>some familiarity with machine learning</b>, not necessarily technical, but in the sense that you know what it is used for, and how it works, broadly speaking. If you don't, it is OK, we will still review some of those concepts as we proceed, but we may go a bit faster, so make sure you ask plenty of questions. You are also encouraged to review concepts on your own during and after the workshop.

#### Python preliminaries
<ul>
    <li>Basic Python structures (lists, dictionaries, tuples) and control flow (loops, conditionals, etc.)</li>
    <li>Numpy arrays</li>
    <li>Pandas dataframes</li>
    <li>Basics of matplotlib for visualization</li>
    <li>Basics of classes and objects (we will do a quick review)</li>
</ul>

We will have a quick review of python structures, arrays, and dataframes, but we cannot linger much on them. We will use `matplotlib` and `seaborn` for plotting, but you do not really need to know, at this stage, all the specifics for how those work.

# <span style = "color:rebeccapurple">What is Machine Learning?</span>

## <span style = "color:darkorange">Conceptual Intermezzo - Machine Learning tasks, concepts, models

See slides

# <span style = "color:rebeccapurple">Python Basic Objects</span>

## <span style = "color:darkorchid"> Imports

It is best practice to place imports at the top of your python script (`.py`) or your jupyter notebook (`.ipynb`). You may not be familiar with what we are importing next, but it will be clear by the end of the notebook.

In [6]:
import numpy as np      # <-- Numpy is usually imported as 'np' (this is convention)
import pandas as pd     # <-- Pandas is usually imported as 'pd' (this is convention)

## <span style = "color:darkorchid">Data Structures in Python</span>

We will use lists, dictionaries, tuples, numpy arrays and pandas dataframes. While you don't need to be an expert, some intuition on how these work and what their differences are is crucial. Let's review of of these.

### <span style="color:teal">Base python structures</span>

Base python has three basic structures you should strive to be acquainted with:
<ul>
    <li>lists</li>
    <li>dictionaries</li>
    <li>tuples</li>
</ul>

#### Lists

Lists are ordered collections of objects. You can create empty lists, increase the size of lists, and obtain specific elements through indexing.

In [2]:
# Empty lists
my_list = []
print(my_list)

[]


In [3]:
# Non-empty lists
my_list = [1, 2, "hello"]
print(my_list)

[1, 2, 'hello']


In [4]:
# Adding an element to a list
my_list.append("good bye")
print(my_list)

[1, 2, 'hello', 'good bye']


In [5]:
# Iterating over a list
for i in my_list:
    print(i)

1
2
hello
good bye


In [6]:
# Indexing a list
print(my_list[0])
print(my_list[1:3])

1
[2, 'hello']


In [7]:
# Overwriting an element in a list:
my_list[3] = "hello again"
print(my_list)

[1, 2, 'hello', 'hello again']


**Note**

Lists are agnostic to data type: not all elements must be of the same type, and they don't have to be numbers. Hence, lists do not represent "vectors" or "matrices" (that's the role of numpy arrays), and they are not tables either (that's the role of pandas dataframes).

#### Dictionaries

Dictionaries are also collections of objects, but instead of being indexed by order, as in the list case, they have a *key*, which uniquely identifies the *value* of your object. These are called key-value pairs.

In [8]:
# An empty dictionary
my_dict = {}
print(my_dict)

{}


In [9]:
# A non-empty dictionary with key-value pairs formatted as key:value
my_dict = {"data": [1,2,3],
          "salutation": "hello",
          "inception": {"some key":"some value"}}
print(my_dict)

{'data': [1, 2, 3], 'salutation': 'hello', 'inception': {'some key': 'some value'}}


In [10]:
my_dict["data"]

[1, 2, 3]

In [11]:
my_dict["salutation"]

'hello'

In [12]:
my_dict["inception"] # <-- This is a dictionary inside a dictionary. Inception!

{'some key': 'some value'}

**Note**

We have used only strings as keys. Most of the time this will be the case. It is possible to use other objects as keys, but we won't go into that (if you know about mutability and hashability, only immutable and hashable objects can be keys).

#### Tuples
Tuples are ordered collections of objects, almost like lists, BUT, they are immutable. Meaning that you can't change them as you did with lists. You cannot change their elements, add a new element, delete one, etc.

In [13]:
# :: TUPLES ::
my_tuple = (1,2,"hello")
print(my_tuple)

(1, 2, 'hello')


In [14]:
my_tuple[0]

1

In [15]:
my_tuple[0] = 10

TypeError: 'tuple' object does not support item assignment

You should get an error in the last cell, that's because you can't change elements of tuples.

### <span style="color:teal">Numpy arrays

**What is Numpy?**

Numpy arrays will take the role of vectors and matrices in python. They are ordered collections of objects, like lists, but there is an important constraint: all elements must be of the same type. Furthermore, if your elements are numeric, you can do numeric operations on the arrays, including matrix multiplication.

**Why Numpy?**

Numpy is great for matrix-like numerical operations, which is the computational basis of most machine learning algorithms. Dealing with numpy arrays is much faster and memory efficient than with pandas dataframes (more on those below). Many machine learning libraries, including `scikit-learn`, are happier when you input data as a numpy array and not as a pandas dataframe.

**Importing Numpy**

Numpy is a python package that is not imported by default, so we must import it. Chances are you already have numpy installed, so there is no need to install it. We have imported it at the top of this ipynb file.

In [9]:
# create an array
my_array = np.array([1,2,3])
print(type(my_array))
my_array

<class 'numpy.ndarray'>


array([1, 2, 3])

Note the object type is 'array'. If you were to print it though, it looks just like a list:

In [17]:
print(my_array)

[1 2 3]


In [18]:
# Index an array
print(my_array[0])

1


In [19]:
# Perform numeric computations with an array:
print(2 * my_array)

[2 4 6]


In [20]:
# Compare that with a list:
print(2 * [1,2,3])

[1, 2, 3, 1, 2, 3]


In [21]:
# Let's build a matrix:
my_matrix = np.array([[0,1,0],
                     [1,0,0],
                     [0,0,1]])
my_matrix

array([[0, 1, 0],
       [1, 0, 0],
       [0, 0, 1]])

In [22]:
# We can multiply arrays as if they were matrices/vectors with the matmul() method
np.matmul(my_matrix, my_array)

array([2, 1, 3])

### <span style="color:teal">Pandas dataframes

Pandas is another package that is commonly used in python but must be imported. Pandas is mostly used to manipulate datasets, for example by subsetting. The main object in pandas is the dataframe, which is basically a table. Pandas allows you to manipulate these tables easily. When it comes to datasets, the convention is for rows to be the different observations (for example patients) and for columns to be the observed features (for example age, sex, etc.). 

In [23]:
# Creating a dataframe from a list of lists
pd.DataFrame(data = [[1,2],[3,4], [5,6]])

Unnamed: 0,0,1
0,1,2
1,3,4
2,5,6


Note it automatically created column headings (0,1 in this case) and row indices.

In [24]:
# Creating dataframe with column names:
pd.DataFrame(data = [[1,2],[3,4], [5,6]], columns = ["Column 1", "Column 2"])

Unnamed: 0,Column 1,Column 2
0,1,2
1,3,4
2,5,6


In [25]:
# Creating dataframe from a dictionary:
pd.DataFrame(data = {"Column 1": [1, 3, 4], "Column 2": [2,4,6]})

Unnamed: 0,Column 1,Column 2
0,1,2
1,3,4
2,4,6


Note that with the dictionary each key-value pair is a column. In the case of nested lists, each sublist is a row.

In [11]:
# You can also use numpy arrays:
my_matrix = np.array([[0,1,0],
                     [1,0,0],
                     [0,0,1]])

df = pd.DataFrame(data = my_matrix, columns = ["col1", "col2", "col3"])
df

Unnamed: 0,col1,col2,col3
0,0,1,0
1,1,0,0
2,0,0,1


In [27]:
# To get the columns of a dataframe, you can call the column names as you'd do with dictionaries:
df["col1"]

0    0
1    1
2    0
Name: col1, dtype: int64

This actually return a pandas *series*, which are basically single columns (the numbers on the left are the indices, not actual values). If you want to return a *dataframe*, which is often necessary, you can use double brackets:

In [28]:
df[["col1"]]

Unnamed: 0,col1
0,0
1,1
2,0


In [12]:
# You can create new columns also as with dictionaries:
df["col4"] = [1,1,1]
df

Unnamed: 0,col1,col2,col3,col4
0,0,1,0,1
1,1,0,0,1
2,0,0,1,1


In [30]:
# If you only want to see the first few elements of your dataframe, use the .head() method:
df.head()

Unnamed: 0,col1,col2,col3,col4
0,0,1,0,1
1,1,0,0,1
2,0,0,1,1


In [13]:
# You can drop a column if you don't want it. We'll do this a couple of times in the workshop:
df.drop(columns=['col4'])

Unnamed: 0,col1,col2,col3
0,0,1,0
1,1,0,0
2,0,0,1


In [16]:
# Dropping a column does not change the original dataframe, so you must reassign or assign to new variable
df.drop(columns=['col4'])
print(df)

df = df.drop(columns=['col4'])
print(df)

   col1  col2  col3  col4
0     0     1     0     1
1     1     0     0     1
2     0     0     1     1
   col1  col2  col3
0     0     1     0
1     1     0     0
2     0     0     1


In this case there is no difference bc we only have three rows, but you will see it used later with larger data.

## <span style = "color:darkorchid">Classes and Objects</span>

Python is very versatile. You can write function after function if you wish. However, its strength stems from **object oriented programming**. The `scikit-learn` library makes plenty of use of objects and classes, so let's take a quick review at what these are.

Imagine you are at a high-end restaurant. Let's say you are seeing Gordon Ramsay at work. There is an executive chef, a head chef, several sous-chefs, specialized chefs (for example for roasting, for pastries, etc.), each with a team of specialists (the butcher, the grill chef, the baker, the confectioner, etc.). Here is an image I got from google images:

![brigade](images/brigade-de-cuisine-high-speed-learning.png){width=50%}

Think of each of these as a type or a *class* of chefs. There are important things to note:
<ul>
    <li>They are all of the generic type <i>chef</i>, but some have extra skills or responsibilities.</li>
    <li>You can have several chefs of the same class (like several sous-chefs). However, these are not the same people!</li>
</ul>

Well, classes in python are something similar. It is a specified type of entity that has specific skills and attributes. Objects are the realizations of these classes. Each realization is called an *instance*. In python, the skills are called *methods*, these are actions that all instances of a class can perform. They also have *attributes* which are variables, possibly unique, that each instance has (like the names of individual chefs).

We won't go further over classes but the important principle is this:

When you are writing a large python code, do not think of yourself as a homecook that does everything by themselves from scratch, following each step one after another. Instead, **think of yourself as an executive chef**. First you appoint all the chefs working for you, each with predetermined skills and attributes. You think of the overall plan, and then you delegate tasks to the respective chefs (which may, in turn, delegate tasks to their respective chefs and specialists).

Indeed, `scikit-learn` is very similar. It has classes of classifiers, of pre-processors, of regression models, etc. Your job will not be to cook absolutely everything by yourself, but to organize things conceptually, hire your appropriate chefs, tell them what to do and trust them.

### <span style="color:teal">Syntax for class instances

We already encountered several classes: lists, dictionaries, tuples, numpy arrays and pandas dataframes. Let's review class syntax with pandas drataframes:

**The Pandas DataFrame object**

In [18]:
# To create the instance of a class:
df_instance = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,10])

In [32]:
# Check what the class of our new object is
type(df_instance)

pandas.core.frame.DataFrame

Attributes

In [33]:
# Your object can return its own data/attributes to you. For that use the dot syntax:
df_instance.shape

(10, 1)

Methods

In [34]:
# Your object can also perform actions (methods). Use dot plus parentheses:
df_instance.head()

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [35]:
# A method with an argument:
df_instance.apply(sum)

0    55
dtype: int64

List all methods and attributes:

In [19]:
dir(df_instance)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__arrow_c_stream__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__dataframe_consortium_standard__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__r

(But it's better to just look at the documentation of whatever class you are using)

**An imaginary ChefClass object**

If we were to have a `ChefClass` as in our example above, we would create instances, obtain attributes, and call methods in the following manner.

```python
# If you already have a class called ChefClass, you can create an instance like this:
chef = ChefClass()

# You can obtain attributes using a period:
chef.specialty

# You call a method (ask it to perform a skill) using a period and parenthesis, as if it were a function call:
chef.cook()
```

These are the three main ingredients of python objects:
1. Create an instance: `x = ClassName()`
2. Obtain internal data: `x.some_attribute`
3. Perform an action: `x.some_method()`

**Scikit-learn objects**

I emphasize the three points above because we will be using them a lot with `sklearn`. You will see a similar recipe as above, we create instances of `sklearn` objects, we make them perform an action, and then we obtain data from it.

This will become clearer with examples, but keep it in mind!

When you are learning a new library in python, remember to keep an *object orientend* mindset:
- What type of objects am I dealing with?
- What operations can I do with these objects?
- What information, if any, do these obejcts hold?
- What methods, if any, do these objects perform?

For example, `Numpy`'s basic object is the *array*, while `Pandas`' basic object is the *dataframe*. An array is like a vector or matrix, hence it makes sense to add them, multiply by a scalar, etc. A dataframe is more of an information table, hence it makes sense to search for data, expect non-numerical data, etc. A list, a much more basic object, contains any type of objects, hence I should not think of it as a vector or as a dataframe.

And on to `scikit-learn`!