* * *
<pre> NYU Paris            <i> Artificial intelligence - Fall 2023 </i></pre>
* * *


<h1 align="center"> Lab 1: Introduction to Python </h1>

<pre align="left"> September 8th 2023               <i> Author: Hicham Janati </i></pre>
* * *


##### Goals:
- Have a working Python / Jupyter environment with packages installed
- Know where to look for information: external resources & documentation
- Discover core packages of the Python ecosystem:
    1. `Numpy` for scientific computing
    2. `Matplotlib` for plotting and visualization
    3. `Pandas` for manipulating heterogenous dataframes
    4. `scikit-learn` for Machine learning models

# 1 - Getting started with Jupyter
1. Anaconda is a large Python distribution software that includes  several Python packages and tools and simplifies their management. If you do not have it installed on your laptop go over to <a href=https://docs.anaconda.com/anaconda/install/>the official documentation</a> and follow the OS dependent installation guidelines.
2. Create a labs folder for the Python labs of this course. For me, this path is `/Users/hichamjanati/Documents/work/teaching/NYU/labs`. Download the notebook `introduction_python.ipynb` from the [class website](https://aiteachings.github.io/NYU-AI-Fall23/lectures/) and save it in that folder.
3. Now we need to launch Jupyter from the labs folder. To do so, open the Anaconda Terminal (Anaconda Prompt). A command line window should pop up. If you are not familiar with line commands, for our purposes all you need to know is:
* Know what folder your terminal is in: `pwd` (print working directory) for Linux/Mac | `cd` for windows.
* List the content of the current folder: `ls` for Linux/Mac | `dir` for Windows.
* Change the current folder: `cd` followed by the subfolder you want to go to. Below, we complete the path with `work/teaching/NYU/code`:

![terminal-screenshot](img/terminal.png)

4. Now launch Jupyter by running: `jupyter notebook` which should open the browser. You should be able to access and run this notebook.

# 2 - Getting started: Python in Jupyter

## 2.1 running cells and magic commands
The following Python cell is a naive loop that stores the first 10 million numbers. Click-on then hit `Shift+Enter` to run it.

In [2]:
%%time

N = 10_000_000 # underscores in whole numbers are ignored by Python

numbers = []  # create an empty list
for ii in range(N):
    numbers.append(ii) # add the number to the list
total = sum(numbers)

print(f"The total is {total}")

The total is 49999995000000
CPU times: user 309 ms, sys: 147 ms, total: 456 ms
Wall time: 497 ms


The `%%time` in the beginning of the cell is called a _magic command_ that keeps track of the time it took the CPU to run the entire cell. Magic commands with one percentage apply for one lines only:

In [6]:
%time print("The total is", sum([ii for ii in range(N)]))

The total is 49999995000000
CPU times: user 175 ms, sys: 95 ms, total: 270 ms
Wall time: 299 ms



### Question 1: 
Lists created with loops from within are this one are called _list comprehensions_.
Can you guess why is this list comprehension 2x faster than the classic loop above ?

In [23]:
"If you dont need individual elements of a list, list comprehension works best, faster than for loops"
"When you create a list, you can mix whatever types you want in python. Python can't compute sum of mismatched types because there is a check for types"
"Takes a lot of computation time"
"the size of the list is not fixed. For loop in list comprehension is faster"

"When you create a list, you can mix whatever types you want in python. Python can't compute sum of mismatched types because there is a check for types"

In [3]:
print(100)

100


In [5]:
"Hello"

'Hello'

## 2.3 Jupyter cells: code, markdown and shortcuts

One of the main advantages of Jupyter is the ability to alternate between text cells (such as the one you are reading right now) and Python cells. Double-click on this text to edit it. Press `Shift + Enter` to leave Edit Mode.

Text cells are actually *Markdown* cells. Markdown is a _super_ light formatting language. As you probably noticed by editing some of these cells, **double asterisks make text bold**, while _underscores make text italic_. But I digress, Markdown is a long story for another day, check [the documentation](https://www.markdownguide.org/getting-started/) for more. More importantly, here is how to create one:

1. Enable the _command mode_ by hitting `Esc`. Command mode in Jupyter makes the cursor disappear.
2. Press `M` to switch from code to Markdown
3. Or press `Y` to switch from Markdown to code.

Several other shortcuts exist in the _command mode_. The ones I usually are:
1. `A` to add a new cell above
2. `B` to add a new cell below
3. `Enter` to go into Edit Mode and edit a cell
4. `H` to open the help and check all other shortcuts

Keep in mind these shortcuts only work in Command Mode (i.e after pressing `Esc`).

## 2.3 The Numpy library

#### Speed and vectorization

Let's write code performing the same counting operation above but using the `Numpy` library. 

In [9]:
import numpy as np

In [10]:
%%time
N = 10_000_000
numbers = np.arange(N)  # creates an array of integers from 0 to N-1
total = numbers.sum()   # sums the array
print(f"The total is {total}")

The total is 49999995000000
CPU times: user 58.8 ms, sys: 40.4 ms, total: 99.3 ms
Wall time: 115 ms


As you can see `Numpy` is 3x faster than list comprensions and code is much shorter. `Numpy` should always be preferred to native Python lists when dealing with nothing but numbers and matrices. `Numpy` vectorizes operations: instead of going through the elements one by one as in a for loop, operations are applied at the same time.

The dot product between two arrays x, y of length n is given by: $<x, y> = x^{\top} y = \sum_{i=1}^n x_i y_i$. 

### Question 2:
Complete the following cells to compare the speed of dot products using native loops vs the numpy operation `result = x.dot(y)`.

In [11]:
## TO DO

N = 10_000_000
x = np.random.randn(N)  # creates a list of random numbers following the Gaussian bell curve distribution
y = np.random.randn(N)

result_loops =  x.dot(y)

### TO DO 


In [7]:
# TODO

result_numpy = 
print(result_numpy)

NameError: name 'x' is not defined

#### Numpy slicing

Numpy offers a simple way to select subsets of the array called slicing. To get the slice from the 3-th to the 5-th element for instance:

In [14]:
x = np.arange(10)
print("All the array: ", x)
print("A slice: ", x[2:5])

All the array:  [0 1 2 3 4 5 6 7 8 9]
A slice:  [2 3 4]


Remember that Python starts indexing with 0 and that a slice [start:end] includes `start` but excludes `end`. If ommitted, `start` is set to 0, and `end` is the last index of the array. Picking the first 6 elements:

In [13]:
x[:6]

array([-0.20897204,  0.28719563, -0.31670121, -0.18392602,  0.20478191,
        0.86824038])

To omit the last number, one can use negative indices. For example:

In [None]:
x[:-1]

In [None]:
x[:-3]

Slices can also have a third parameter which is a `step`. So far we omitted this argument which is by default equal to 1. Starting at 0, we can pick even indices by using a step of 2:

In [None]:
x[0:10:2]

We can omit the start and end arguments since they are not doing anything:

In [None]:
x[::2]

### Question 3:
Create a slice that picks odd numbers in reverse order using one slice only.

In [26]:
x = np.arange(10)
print(x[::-2])

[9 7 5 3 1]


## 2.4 Matplotlib: plotting and visualization

###### Example 1: curve plot
We want to plot the function $f: x \to x^2$ over the range $[-5, 5]$.
To do so we create a regular grid of 1000 numbers in $[-5, 5]$ using the Numpy function `linspace`.

In [None]:
import matplotlib.pyplot as plt

# we create our square function

def f(x):
    return x ** 2 # basic operations are applied element-wise to numpy arrays

n = 1000
x = np.linspace(-5, 5, n)
fx = f(x) 

plt.figure(figsize=(3, 3))
plt.plot(x, fx)
plt.show()

We can improve the quality of this visualization by adding:
- a grid
- a title 
- labels for axes

In [None]:
plt.figure(figsize=(3, 3))
plt.plot(x, fx)
plt.grid(True)
plt.xlabel("x")
plt.ylabel("f(x)")
plt.title("Square function")
plt.show()

### Question 4
Assume you forgot what the function `np.linspace` takes as arguments. You can either:
- Look up the Numpy documentation directly (numpy linspace search on google should immediately lead you to numpy.org) 
- Run the following cell to get the documentation window on Jupyter:

In [None]:
np.linspace?

##### Example 2: scatter plot

Assume we have data of grades obtained by 10 students and the number of hours spent on assignments per week.  

In [None]:
data = np.array([[80, 3],
                 [71, 4],
                 [95, 6],
                 [62, 1],
                 [68, 0],
                 [100, 5],
                 [95, 7],
                 [95, 5],
                 [83, 3],
                 [71, 2],
                 [80, 3],
                 [82, 2],
                 [80, 5],
                 [80, 4],
                 [86, 6],
                 [84, 8]])

print(data.shape)

We visualize these data points as 2D coordinates, each student represented by a dot. To do so we call the `scatter` function of matplotlib.

In [None]:
plt.figure(figsize=(4, 4))
plt.scatter(data[:, 0], data[:, 1])
plt.grid(True)
plt.xlabel("Grade")
plt.ylabel("Hours / week")
plt.show()

###### Example 3: random numbers and histograms

The following cell generates two arrays: 
- 1000 random uniformly distributed numbers in $[-3, 3]$
- 1000 random numbers following the Gaussian (normal) distribution.

We display their histograms with a legend.

In [None]:
N = 1000
uniform = np.random.rand(N)
gaussian = np.random.randn(N)

plt.figure()
plt.hist(uniform, bins=50, color="cornflowerblue", label="Uniform", density=True)
plt.hist(gaussian, bins=50, color="gold", label="Gaussian", alpha=0.7, density=True) # alpha parameter controls transparency
plt.legend()
plt.grid()
plt.show()

Run the cell multiple times. What do you notice ?

When writing code, we always want it to be deterministic i.e reproducible. To fix
the randomness of the data each time we run it, we need to fix the _seed_ (or state) of the random generator:

In [None]:
seed = 42
rng = np.random.RandomState(seed)
rng.randn()

### Question 5
Add this before the plot to change how numbers are generated. What do you notice now ?

## 2.5 - Pandas dataframes:  Heterogenous data
While Numpy is excellent and can handle matrices (arrays) of any shape, the elements of the arrays must have the same type which seldom happens in practice. Real data variables often include:
- ordered integers (for count data such as number of houses, number of genes, or survery satisfaction scores)
- real numbers (prices, distances, anything measurable with a certain degree of accuracy)
- strings (text, names ..)
- unordered integers (for e.g continent: 0 for Europe, 1 for America, 2 Africa, 3 Asia)

This is where `Pandas` comes in to play. Let's create a `Pandas` dataframe from the grade data.

In [None]:
import pandas as pd

df = pd.DataFrame(data, columns=["grade", "hoursworked"])
df.head()

We can create a simple categorical variable for the quality of the grade and add it as a new column and change its type to categorical:

In [None]:
df["quality"] = ["bad" if x < 70 else "good" for x in df.grade]
df["quality"] = df["quality"].astype("category")
df.head()

The types of the dataframe variables:

In [None]:
df.dtypes

To have a quick overview of the numerical variables of the entire table, run:

In [None]:
df.describe()

To select portions of the dataframe. You can use `df.loc` for slicing by labels or `df.iloc` for slicing by indices.

In [None]:
df.head()

In [None]:
df.loc[[1, 3], ["grade", "quality"]]

In [None]:
df.iloc[[1, 3], :2]

Pandas has its own plotting functions using matplotlib under the hood. For example, you can quickly view the grade distribution with:

In [None]:
df.grade.hist()

Pandas has much more to offer, if you want to learn more about Pandas, please check their [overview tutorial](https://pandas.pydata.org/docs/user_guide/10min.html).