In this notebook, we are going to review some of the basics in Python that you partially covered in P0. This is slightly different to P0 however as you are interacting with a Notebook rather than a Python prompt. The pros of a Notebook is that it's a lot more user-friendly and easier to parse and organise. For teaching purposes, it's much better!

We will get started on some very easy exercises and progressively see more complicated topics. 

The goal of this short tutorial is not so much to make you a programmer, but to give you all the tools that could be needed in future lectures. For example, we do not cover functions or classes, which would be a necessity for a programmer; but we will spend some time on operations relating to dataframes.

Please fill in the document as you go!

# 1. Preambles in a Jupyter Notebook

Python is a pretty "basic" shell. It contains for example commands to do basic arithmetic and basic data structures (such as lists). For any package we wish to use in a notebook on top of the basic functionalities, we first need to import it, often renaming it in the process to make it shorter to type out. 

We will use a couple of packages throughout the class:


*   Numpy for arrays
*   Pandas for dataframes
* Matplotlib for graphing
* Scikit-learn for machine learning

These are the main ones. We may sometimes use other packages for class-specific reasons. I will let you know when this occurs.

In [33]:
import numpy as np
import pandas as pd

In general, in preambles of the notebooks I give you, there is also a part that reads in the data that we are going to use throughout the lecture. We will see how we "read in" data later on.

# 2. Print, variable assignment and types, basic operations, lists

All of the concepts we see here should be easy and immediate for you in the long-run (i.e., you should be so familiar with them that you don't need any googling or help using them). 

## 1. Printing

The "print" command enables us to "see" what is going on in the background. We use it in this way: `print("Hello World!")`. Try it out!

In [None]:
print("Hello World!")

Hello World!


Note that you can get round using `print` if you just type `Hello World!` e.g. Try it out!

In [None]:
'Hello World!'

'Hello World!'

This only works for the last command you type in though. If you type ` "Hello World!"` and then, in the same block, `"My name is..."`, only the last one appears. Try it out! This is where `print` is particularly useful.

In [None]:
"Hello World!"
"My name is..."

'My name is...'

## 2. Variable Assignment and Types

It is very useful to know how to store information into a variable. For example, if the price of a good is 28.75 euros and I don't want to keep remembering this number, I can simply write `p=28.75` and then call `p` throughout. This is known as *variable assignment*: we assign to the variable p the value 28.75. 

Try assigning the integer 104 to the variable `number`. Then print out `number`.

In [None]:
number=104
print(number)

104


To organize things, Python classifies information into categories and then handles these categories differently. There are many categories but we just focus on four: integers (such as 104), floats (you can view these as decimal numbers), strings (which are text based), and Booleans (True/False). For Python to recognize these categories, you don't have to do much except input your information correctly. You can use `type` to see what this gives you. For example:

In [None]:
number=104
type(number)

int

Now pick your favorite float and string, assign them to some variables and print out their type!

In [None]:
myfloat=3.5
print(type(myfloat))
mystring="Georgina"
print(type(mystring))

<class 'float'>
<class 'str'>


## 3. Basic operations

These are the basic arithmetic operations that you may have to conduct: they include `+` for addition, `-` for substraction, `*` for multiplication, `/` for division and `**` for powers. 

Our first exercise is to take any temperature in Celsius and conver it to Fahrenheit. For example, $0^\circ C \cdot 9/5+32= 32^\circ F$. How much is $15^\circ$ Celsius in Fahrenheit?

In [None]:
15*9/5+32

59.0

Compute the area of a circle of radius 3. (Reminder: $A=\pi R^2$ and use `np.pi` for $\pi$.)

In [None]:
R=3
np.pi*3**2

28.274333882308138

The `%` operation gives the remainder of a division: try `10%2` versus `11%2`.

In [None]:
10%2

0

In [None]:
11%2

1

## 4. Lists

These are the most basic data structures one can encounter in Python. Defining a list L can look something like this `L=[1,2,3]`. A list can take as an entry any type of variable (strings, floats, integers). It can also be taken to be empty if the goal is just to add to it later `L=[]`.

Create a list L that contains: your first name, your last name, your age. Print it out.

In [2]:
L=["Georgina","Hall",31]
print(L)

['Georgina', 'Hall', 31]


Check that your list has size 3 by running `len()`.

In [3]:
len(L)

3

Access the second element of your list using `L[1]`. Why is it not `L[2]`? How do I access the last two elements e.g.?

In [4]:
print(L[1])
print(L[1:3])

Hall
['Hall', 31]


Note that `L[-1]` gives access to the last element.

In [5]:
L[-1]

31

Change the first element of the list to your second name, or a name you would have liked to have if you don't have a second name. Use the assignment operator (`=`) to do this.

In [6]:
L[0]="Louise"
print(L)

['Louise', 'Hall', 31]


Using `.append` add your month of birth (as a string) to your list L.

In [7]:
L.append("October")
print(L)

['Louise', 'Hall', 31, 'October']


Suppose I want to add two pieces of information to the list: the class name and number, i.e., I want to add `M=["ML&O", 1]` to the existing list L. Try using `.append()` and `.extend()` to see how they are different. Which one should you use?

In [8]:
L=["Georgina","Hall",30]
M=["ML&O",1]
L.append(M)
L

['Georgina', 'Hall', 30, ['ML&O', 1]]

In [9]:
L=["Georgina","Hall",30]
M=["ML&O",1]
L.extend(M)
L

['Georgina', 'Hall', 30, 'ML&O', 1]

Note that `.append` and `.extend` modify the *original* list. If you want to create a new variable that contains L with M added on you would use the `+` operator thus:

In [10]:
L=["Georgina","Hall",30]
M=["ML&O",1]
NewList=L+M
print(NewList)

['Georgina', 'Hall', 30, 'ML&O', 1]


One last operator that might be useful for lists is the ability to count frequencies `.count`. Use `.count` to count the number of occurrences of 5 in the list below.

In [11]:
L=[1,2,4,2,4,5,1,5,0]

In [12]:
L.count(5)

2

Similar functions that can be used are `.sort` which sorts the list for you, `.insert` which inserts an element at a specified index, and `.index` which returns the first index of appearance of a given item.

# 3. Conditions, if...then...else, and loops

It can be very useful to be able to check whether a condition is verified or loop through a list. We give more examples now.

## 1. Conditions

The first condition is simply whether something is equal to something else. We use here `==`: when the left hand side of the double equality is equal to the right hand side, True is returned. Otherwise False is returned. Note that `==` and `=` mean fundamentally different things: `==` is about checking whether two things are the same, `=` assigns a value to whatever is on the left hand side. Note that not equal to is `!=`.

Create a variable `x` equal to 2. Check using `==` whether `x` is equal to 2.

In [13]:
x=2
x==2

True

There are also the `and` and `or` operators. The `and` operator returns True if both conditions it links are true. The `or` operator returns True if at least one of the conditions is true.

Try it out for yourself: define a variable `x` equal to `10`. Try both the `and` and `or` operators with conditions x is equal to 10 and x is greater than or equal to 15.

In [14]:
x=10
x==10 and x>=15

False

In [15]:
x=10
x==10 or x>=15

True

The final condition we see here is the condition `in` which checks whether a given element is in a given structure.
Define the list `L=[3,4,5]` and the variable `x=3` then check whether x is in L. What happens if `x=1`?

In [16]:
L=[3,4,5]
x=3
print(x in L)
x=1
print(x in L)

True
False


## 2. If...then..else

We have seen how to check different conditions on variables. It may be useful sometimes to have some action based on the answer to the condition. This is codified in the if..then..else type routines: if [condition] is satisfied, then [do this], else [do this]. For example:

In [17]:
name = "John"
age = 23
if name == "John" or age == 23:
    print("Your name is John or you are 23 years old.")

else:
    print("Your name mustn't be John")

Your name is John or you are 23 years old.


Two things are **crucial** here: the two dots at the end of if and else and the indentation. If you don't have the tabulation before print this code will not work. Try it out!

Sometimes you may have many conditions to check, in which case you would use elif, in this way:

In [18]:
name = "John"

if name == "Rachel":
    print("Your name is Rachel.")

elif name=="John":
    print("Your name is John.")

else:
    print("Your name mustn't be John, nor Rachel.")

Your name is John.


Your turn: create a list with a number of elements between 3 and 5. Then have an if...then...else type statement, which returns 3 if the list of elements has length 3, 4 if it has length 4, 5 if it has length 5, "there is an error" otherwise.

In [19]:
L=[1,2,3,4,5,6]

if len(L)==3:
    print(3)

elif len(L)==4:
    print(4)
    
elif len(L)==5:
    print(5)
    
else:
    print("There is an error.")

There is an error.


## 3. For loops

For loops enable us to repeat an operation many times without typing it out at each iteration.

For example, we can use it to sum over all possible items in a list:

In [20]:
L=[2,5,10,4]
sum_L=0

for i in L:
    sum_L=sum_L+i
    
print(sum_L)

21


The notion of `range` is very useful for for loops. Try the following code to understand what range does.

In [21]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


In [22]:
for i in range(5,10):
    print(i)

5
6
7
8
9


In [23]:
for i in range(0,10,2):
    print(i)

0
2
4
6
8


How could we use `range`, `len(L)` to compute the sum of all elements in the list L above?

In [24]:
L=[2,5,10,4]
sum_L=0

for i in range(len(L)):
    sum_L=sum_L+L[i]
    
print(sum_L)


21


One can also have a `while` clause: for example:

In [25]:
count = 0
while count < 5:
    print(count)
    count = count+1 

0
1
2
3
4


This means that while the condition `count<5` is satisfied, we keep executing the command. As soon as it stops being satisfied, we stop.

These for loops can be very useful to create lists. For example, if we want to create a list with all powers of 2 from 1 to 16, we can do so like this:

In [26]:
List_powersof2=[]
for i in range(0,5):
    List_powersof2.append(2**i)

print(List_powersof2)

[1, 2, 4, 8, 16]


Your turn! Create a list with all the powers of 10 up to 100000 (included), **removing 100.** (You may want to use an if statement for this.)

In [71]:
List_powersof10=[]
for i in range(0,6):
  if 10**i !=10:
    List_powersof10.append(10**i)

print(List_powersof10)

[1, 100, 1000, 10000, 100000]


# 4. Data structures

Data structures play a key role for us: we start with numpy arrays, then panda dataframes (which we will use heavily in the Machine Learning part of the course), and then dictionaries (which we use in Optimization).

## 1. Numpy arrays

Numpy arrays are an alternative to Python lists. As indicated by the name, they are part of the Numpy package. The advantages are that they are fast, easy to work with, and give users the opportunity to perform calculations across entire arrays. Note that elements of the array are accessed in exactly the same way as for lists.

Generally, to create a numpy array, we first write a list and then specify that it is a numpy array. This of course requires the numpy package to be imported. For example:

In [34]:
# Create 2 lists height and weight
height = [1.67,  1.87, 1.82, 1.60, 1.73, 1.85]
weight = [55, 100, 83, 91, 61, 70]

# Create 2 numpy arrays from height and weight
np_height = np.array(height)
np_weight = np.array(weight)

Check what the type of `np_height` is.

In [35]:
type(np_height)

numpy.ndarray

Numpy arrays are useful for many things. First off, it is very easy to do computations with them. For example, if you want to create an array `bmi` that contains the BMI (given by $weight/height^2$) then this can easily be done by acting as if the arrays are just numbers. Try it out!

In [36]:
bmi=np_weight/np_height**2
print(bmi)

[19.72103697 28.59675713 25.05736022 35.546875   20.38156971 20.45288532]


Another very useful function is the ability to filter the array as needed. For example: healthy BMIs are between 18 and 25. We can select in the bmi list the bmis that match by doing simple conditioning. Note that for sets, and becomes `&` and or becomes `|`. Do not forget the parentheses!

In [37]:
(bmi>=18) & (bmi<=25)

array([ True, False, False, False,  True,  True])

Do the same for all the BMI values that are *not* healthy (i.e., the values that are below 18 or above 25).

In [38]:
(bmi<=18) | (bmi>=25)

array([False,  True,  True,  True, False, False])

This tells us which indexes satisfy (or don't) this condition. We can also get the actual values if we wish by doing the following:

In [39]:
bmi[(bmi>=18) & (bmi<=25)]

array([19.72103697, 20.38156971, 20.45288532])

These give us the values of the bmi that are healthy.

Your turn: the list below corresponds to grades obtained in an MBA course. The students that obtained between 80 and 90 will get a B. Create a numpy array. Find which grades satisfy this condition. How many are there?

In [40]:
Grades=[12,54,82,100,95,62,87,34,29,98,72,36,85,69,81,96]
np_Grades=np.array(Grades)
Bs=np_Grades[(np_Grades>=80) & (np_Grades<=90)]
len(Bs)

4

You can easily construct arrays of ones and zeros by using the commands `np.zeros((1,5))` (which would give you an array of size $1\times 5$ of zeros) and likewise `np.ones((1,5))` (which would give you an array of size $1\times 5$).

In [41]:
print(np.zeros((1,5)))
print(np.ones((1,5)))

[[0. 0. 0. 0. 0.]]
[[1. 1. 1. 1. 1.]]


 You can also construct arrays that contain a range by using `np.arange()`.

In [42]:
np.arange(20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

That you can then reshape as needed: say you want an array that has 4 rows and 5 columns, we could simply do the following:

In [43]:
np.arange(20).reshape((4,5))

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

To obtain the shape of the array above, we use `shape`:

In [44]:
np.arange(20).reshape((4,5)).shape

(4, 5)

Your turn! Create a numpy array `A` using `np.arange(12)`. Reshape it so that it is of size (4,3). Then create another numpy area `B` using `np.ones(12)`and reshape it so that it is of size (4,3). Try `A+B`, `A*B`, `A.min()`. What happens?

In [72]:
A=np.arange(12).reshape((4,3))
B=np.ones(12).reshape((4,3))
print(A+B)
print(A*B)
print(A)
print(A.min())

[[ 1.  2.  3.]
 [ 4.  5.  6.]
 [ 7.  8.  9.]
 [10. 11. 12.]]
[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]
 [ 9. 10. 11.]]
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
0


Other functions such as `.max` `.cumsum`, `.sum` function in a similar fashion.

Two important functions that we discuss now is `.all` and `.any`. They are also useful for dataframes. Do you understand what they're doing?

In [46]:
(B==1).all()

True

In [47]:
(A==1).all()

False

In [48]:
(A==1).any()

True

## 2. Panda Dataframes

This is an incredibly important type of data structure and one we will use almost exclusively in the machine learning lectures. We will most often read a dataframe from a .csv file (similar to those you can open in Excel). This will require us to use e.g., `Dataset=pd.read_csv("data.csv")`. Here, we don't do this but use a pre-existing dataset from the seaborn plotting package as an example.

In [1]:
import seaborn as sns
tips=sns.load_dataset("tips")

The reason we use dataframes a lot is because they are formatted in a functional way (and so easy to read). What happens if you just type in `tips`?

In [2]:
tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Take a look at the header of tips using `.head()`.

In [51]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


We can also create our own dataframe using a dictionary: we will briefly discuss this when we discuss dictionaries. We will rarely need to do this in our examples.

Dataframes do not treat features (i.e., the columns) and observations (i.e., the rows) in the same way. For example to access column data (e.g., total_bill), it is quite easy. Accessing observations is a bit harder.

Try `tips['total_bill']` and `tips[['total_bill']]`. What is the difference? It may be useful to use `type` to answer the question. We will use `tips[['total_bill']]` in general as this a datatype we master.

In [52]:
print(tips['total_bill'])
type(tips['total_bill'])

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64


pandas.core.series.Series

In [53]:
tips[['total_bill']]

Unnamed: 0,total_bill
0,16.99
1,10.34
2,21.01
3,23.68
4,24.59
...,...
239,29.03
240,27.18
241,22.67
242,17.82


In [54]:
type(tips[['total_bill']])

pandas.core.frame.DataFrame

Try selecting two columns now, say `total_bill` and `smoker`.

In [55]:
tips[["total_bill","smoker"]]

Unnamed: 0,total_bill,smoker
0,16.99,No
1,10.34,No
2,21.01,No
3,23.68,No
4,24.59,No
...,...,...
239,29.03,No
240,27.18,Yes
241,22.67,Yes
242,17.82,No


To access observations, we use square brackets but with integers. For example `tips[0:5]` accesses the first 5 observations. How would you access observations 5 through 10?

In [56]:
tips[0:5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [57]:
tips[5:11]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2
10,10.27,1.71,Male,No,Sun,Dinner,2


Sometimes it can be complicated to understand what the *index* of our dataframe is. We use `.index` for this. What do we get in this case? Do you find this surprising?

In [58]:
tips.index

RangeIndex(start=0, stop=244, step=1)

Accessing one observation rather than a range (as done above) can be done using `.loc`: for example `tips.loc[[5]]`. This just returns the line indexed by 5. If you want to return the line corresponding to position 5, you would use `tips.iloc[[5]]`. The command `.loc` and `.iloc` can also be used for the columns. See the example below.

In [59]:
tips.loc[[5]]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4


In [60]:
tips.iloc[[5]]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4


In [61]:
tips.iloc[:,1:3]

Unnamed: 0,tip,sex
0,1.01,Female
1,1.66,Male
2,3.50,Male
3,3.31,Male
4,3.61,Female
...,...,...
239,5.92,Male
240,2.00,Female
241,2.00,Male
242,1.75,Male


In [62]:
tips1=tips.loc[[5,7,8]]

In [63]:
tips1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2


In [64]:
tips1[1:2]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
7,26.88,3.12,Male,No,Sun,Dinner,4


Using `.loc` or `.iloc`, you can also select many different observations that are not necessarily contiguous, for example, try selecting rows indexed by 5 and 8 using a similar strategy to arrays.

In [65]:
tips.loc[[5,8]]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
5,25.29,4.71,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2


You can also filter in a similar way as done for arrays: for example, if you only want to consider male tippers you would use:

In [66]:
tips[tips["sex"]=="Male"]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.00,Male,No,Sun,Dinner,2
...,...,...,...,...,...,...,...
236,12.60,1.00,Male,Yes,Sat,Dinner,2
237,32.83,1.17,Male,Yes,Sat,Dinner,2
239,29.03,5.92,Male,No,Sat,Dinner,3
241,22.67,2.00,Male,Yes,Sat,Dinner,2


Your turn! Select the rows that correspond to the size of the party being larger or equal to 4.

In [67]:
tips[tips["size"]>=4]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
7,26.88,3.12,Male,No,Sun,Dinner,4
11,35.26,5.0,Female,No,Sun,Dinner,4
13,18.43,3.0,Male,No,Sun,Dinner,4
23,39.42,7.58,Male,No,Sat,Dinner,4
25,17.81,2.34,Male,No,Sat,Dinner,4
31,18.35,2.5,Male,No,Sat,Dinner,4
33,20.69,2.45,Female,No,Sat,Dinner,4
44,30.4,5.6,Male,No,Sun,Dinner,4


An important thing to know how to do is how to drop a column from a dataframe. This is done by using `.drop(columns=["name1","name2"])`. Drop the column "time" from the tips dataset.

In [None]:
tips.drop(columns=["time"])

Unnamed: 0,total_bill,tip,sex,smoker,day,size
0,16.99,1.01,Female,No,Sun,2
1,10.34,1.66,Male,No,Sun,3
2,21.01,3.50,Male,No,Sun,3
3,23.68,3.31,Male,No,Sun,2
4,24.59,3.61,Female,No,Sun,4
...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,3
240,27.18,2.00,Female,Yes,Sat,2
241,22.67,2.00,Male,Yes,Sat,2
242,17.82,1.75,Male,No,Sat,2


In [None]:
tips.mean()

total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64

There are many, many different functions that can be applied to the dataset: `.describe()` is probably the most useful. We also have things such as `.max`, `.min`, `.median`, etc. Try them out!

In [None]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


We briefly discuss `.groupby`. This function enables us to simplify the datasets considerably by grouping together observations that present some similarities and then applying some operation to it.

In [74]:
tips.groupby(["sex"]).mean()

Unnamed: 0_level_0,total_bill,tip,size
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,20.744076,3.089618,2.630573
Female,18.056897,2.833448,2.45977


For example, we observe here that the total bill is generally higher when a man pays than when a woman does. How do the days of the week impact the total bill?

In [None]:
tips.groupby(["day"]).mean()

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,17.682742,2.771452,2.451613
Fri,17.151579,2.734737,2.105263
Sat,20.441379,2.993103,2.517241
Sun,21.41,3.255132,2.842105


Use `.groupby` and `.count` to find the most observed set-up of size of group and sex in the dataset.

In [None]:
tips.groupby(["size","sex"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time
size,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Male,1,1,1,1,1
1,Female,3,3,3,3,3
2,Male,98,98,98,98,98
2,Female,58,58,58,58,58
3,Male,24,24,24,24,24
3,Female,14,14,14,14,14
4,Male,28,28,28,28,28
4,Female,9,9,9,9,9
5,Male,4,4,4,4,4
5,Female,1,1,1,1,1


## 3. Dictionaries

Dictionaries are a default data storage type proposed by Python. They have the advantage of being very flexible with only the information we have present. They are hard to read however. Consider the following dictionary.

In [None]:
scientists={"marie curie":["radioactivity",2], "albert einstein":["relativity",1], "isaac newton":["gravity"]}

In [None]:
scientists

{'marie curie': ['radioactivity', 2],
 'albert einstein': ['relativity', 1],
 'isaac newton': ['gravity']}

The numbers here are the number of Nobel Prizes (only 4 people have ever received 2). Note that as Isaac Newton was born before Nobel prizes existed, he does not have any. This isn't a problem in dictionaries: we can have as much (or as little) information on one of the entries as possible (i.e., there can be a lot of variability between entries). In a dataframe, for example, this would have to be an empty entry, so we would waste room storing a useless entry.

We will see that this is very useful for optimization. Try calling the "marie curie" entry in the dictionary. How would your proceed? Note that "marie curie", "albert einstein", "isaac newton" are what are called *keys*. We can call them by typing in `scientists.keys()`.

In [None]:
scientists["marie curie"]

('radioactivity', 2)

In [None]:
scientists.keys()

dict_keys(['marie curie', 'albert einstein', 'isaac newton'])

We can also use dictionaries to construct dataframes. We first construct a dictionary: `dict={'col1':[information], 'col2':[information]}`. We then use `pd.DataFrame(dict)` to obtain the corresponding dataframe.

Construct a dictionary with columns "capital" and "continent" included. Populate these with the information on your favorite countries. Make this into a dataframe with the index being the countries' names.

In [None]:
countries={'capital':["Paris", "Ottawa","Accra", "Bogota","Ulaanbaatar"], 'continent':["Europe","North America", "Africa","South America","Asia"]}
countries_pd=pd.DataFrame(countries, index={"France","Canada","Ghana","Colombia","Mongolia"})
countries_pd

Unnamed: 0,capital,continent
France,Paris,Europe
Colombia,Ottawa,North America
Canada,Accra,Africa
Ghana,Bogota,South America
Mongolia,Ulaanbaatar,Asia


# 5. Exercises *(Homework)*

## Exercise 1: 

Consider this list `L= [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]`. Write a routine using for loops and if conditions that takes this list and makes a new list that only has even elements of this list in it.

In [75]:
L=[1,4,9,16,25,36,49,64,81,100]
M=[]
for i in L:
  if i%2==0:
    M.append(i)

M

[4, 16, 36, 64, 100]

## Exercise 2

Write a routine that takes this array `arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])` as input and modifies it so that all odd numbers have been replaced with -1. This can be done in a line using subsetting.

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[arr%2==1]=-1
arr

array([ 0, -1,  2, -1,  4, -1,  6, -1,  8, -1])

## Exercise 3

1. Using the `tips` dataframe from above, find the largest total tip left.
2. Isolate the row/observation which this corresponds to. Which day was this? At what time?
3. For each day of the week, find the average total bill left.

In [76]:
import seaborn as sns
tips=sns.load_dataset("tips")

In [77]:
tips[["total_bill"]].max()

total_bill    50.81
dtype: float64

In [80]:
tips[tips["total_bill"]==50.81]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
170,50.81,10.0,Male,Yes,Sat,Dinner,3


In [None]:
tips.groupby("day").mean()[["total_bill"]]

Unnamed: 0_level_0,total_bill
day,Unnamed: 1_level_1
Thur,17.682742
Fri,17.151579
Sat,20.441379
Sun,21.41
