<a href="https://colab.research.google.com/github/lucilesepa/training_python/blob/main/home_ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The home exercises will be a bit different: the theory will be given in one single block, and the exercises will mix up all the known concepts.**

# Basic variable type: boolean

Booleans can take the value True or False. Actually they are not really variables. They are a data type which a condition.  You have already implicitly used booleans. 

You see in this example that the condition a>b is, like all the conditions, a boolean:


In [None]:
a=13
b=1
print(type(a>b))
print(a>b)

<class 'bool'>
True


There are many ways to use booleans, but the most common way in data analysis is to use it with if statement, to test whether a condition is fulfilled.


In [None]:
if a>b:
  print('the boolean a>b is true')


the boolean a>b is true


# Tuples and dictionnaries



## Tuples

Tuples are one of the 4 built-in data types in Python: 
sets  which we will not cover as they are very rarely used in data analysis / GIS, dictionnaries, tuples and lists.

*Note:* 
*   Pandas dataframes are not built-in: they require the use of the package Pandas

In practice in data analysis /  GIS they are rarely used, but many python functions outputs are tuples.

e.g:

In [None]:
import numpy as np
a=[[1,2,3],[9,1,4]]
b=np.array(a)
s = b.shape
print('the output from .shape is a ',type(s))

the output from .shape is a  <class 'tuple'>


They are indexed with the regular use of []

e.g to use the number of rows of an array:

In [None]:
print('there are ',s[1],' rows in b')

there are  3  rows in b


## Dictionaries

They are stored to store paired values.

e.g:

In [None]:
dict = {
  "name": "Glasgow",
  "population": 635640,
  "N": 55.85781,
  "E": -4.24253
}
print(dict)

{'name': 'Glasgow', 'population': 635640, 'N': 55.85781, 'E': -4.24253}


A dictionary cannot really be indexed. Instead, they are accessed as such:

In [None]:
dict['name']

'Glasgow'

In my experience they are rarely used.

# More conversions

During the session, we have seens the conversions between numpy arrays, pandas dataframes, and lists. Similarily, you can convert floats, integers, strings:

In [None]:
a = 2
b=6.3
c='4'
print('a=',a)
print('b=',b)
print('c=',c)
print('str(a)=',str(2))
print('int(b)=',int(b))
print('float(c)=',float(c))

a= 2
b= 6.3
c= 4
str(a)= 2
int(b)= 6
float(c)= 4.0


*Note*: 
*   You can see that converting a number to an integer will truncate the decimals
*   Obviously converting e.g. `m='some string'` to a float will return an error



# Vectorization

Vectorization is an important concept that is defined as replacing a for loop and applying the operation of interest on the vector as opposed to applying it on each element of the vector.

e.g. a non vectorized addition of 3 to the elements of a vector a and storage of the result in b:


In [None]:
a=[1,4,6]
b=[]
for i in a:
  b.append(i+3)
print(b)

[4, 7, 9]


Vectorized expression:

In [None]:
import numpy as np
c=np.array(a)+3
print(c)
type(c)

[4 7 9]


numpy.ndarray

*Note: b and c are not of the same type. Conversion would be needed to make b and c equal*

For loops are very labor-intensive for the computer. Vectorization should always be prefered where possible.

# NaN values

NaN in programming stands for Not a Number. They are considered as a float that does not have a value by the computer. So they are typically ignored when doing arithmetic operations for example. 
Their use is very common in data analysis.

In Python there are several ways to set a value as a NaN, a common way is:
```
float('nan')
```

e.g.

In [116]:
import numpy as np
g=np.array([1,5,float('nan')])
h=np.array([-4,3,7])
k=g-h
print(k)

[ 5.  2. nan]


# try, except, else statement 

Some commands return an error which stop the execution of the code. They are called exceptions. 
You can state that you only want to "try" the command:
```
try:
    <do something>
except Exception:
    <handle the error>
else:
    <if no exception is raised>
```
the `else` bloc is optionnal.

You will see an example of its use in the section "add elements to the end of a list".

# A few more commands

## Add elements to the end of...

### ... a list

Typically used in for loops to populate a dataset. The formal structure is:
```
dataset.append(new_value)
```
e.g. stores into c the results of an addition between a and b


In [58]:
m = [1,5,7,8]
n=[4,0,-3,5]
p=[]
for i in range(len(m)):
  p.append(m[i]+n[i])
print(p)
del(p) # needs to delete p otherwise each time this cell is ran, it keeps adding the values to the end of p1

[5, 5, 4, 13]


*Note*: the previous code is only to demonstrate how .append() works. In real-world coding, this particular code should be vectorized, e.g.


In [59]:
import numpy as np
p_arr = np.array(m)+np.array(n) 
print(p_arr)

[ 5  5  4 13]


numpy.ndarray

### ... a numpy array

This works like .append() for lists, except that the axis needs to be specified, othewise 2D arrays will be flattened. As for the pandas dataframes:

*   The command `np.array(arr1,arr2,*axis*)` does not really append arr2 to arr1: it concatenates them. 
*   arr1 and arr2 have to be numpy arrays or lists (it cannot be values like for lists)

In [50]:
import numpy as np
ar1 = np.array([[1, 2],[3, 4]])
ar2 = np.array([[5, 6],[7, 8]])
ar1_a = np.append(ar1, ar2)
ar1_b = np.append(ar1, ar2, axis=0)
ar1_c = np.append(ar1, ar2, axis=1)
print(ar1_a)
print(ar1_b)
print(ar1_c)

[1 2 3 4 5 6 7 8]
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
[[1 2 5 6]
 [3 4 7 8]]


### ... a pandas dataframe

It broadly works the same way for dataframes as for lists. However, because Pandas have been developped more or less separately from the core of Python, the syntax is not exactly the same:


*   The command `df.append(df2)` does not suffice to add df2 to df: you need to store it in a variable. 
*   df2 has to be a dataframe (it cannot be values like for lists)

Also, the rows indexes will stay the same as in the 2 original dataframes df and df2, unless `ignore_index = True` is specified

e.g. adds a dataframe df_n to a dataframe df_m

In [28]:
import pandas as pd
df_m=pd.DataFrame(m)
df_n=pd.DataFrame(n)
df_m = df_m.append(df_n,ignore_index = True)
print(df_m)

   0
0  1
1  5
2  7
3  8
4  4
5  0
6 -3
7  5


**Important note:** In theory, appending is adding values to an existing dataset, while concatenating is joining two datasets.

## Find an element / the index of an element in...


### ... a list

Finding the index of a value value_to_find in a list is easy:
General code to find the index of an element in a list:
```
mylist.index(value_to_find)
```

*Note*: the code returns an error if the value value_to_find isn't in mylist. 

e.g. find the index of the values 4 and 5 in a list. if the value doesn't exist, fills the variable with NaN



In [64]:
a=[87,8,12,9,10]

try:
  a.index(8)
except ValueError:
  b=float("nan")
else:
  b=a.index(8)

try:
  a.index(5)
except ValueError:
  c=float("nan")
else:
  c=a.index(5)

print(b)
print(c)

1
nan


### ... a numpy array

Finding an element in a numpy array is similar to the syntax for a pandas dataframe: 
```
array.where(expression)
```
with `expression` being very similar in syntax to a SQL query.
However as for a pandas dataframe, another step is needed before you can easily access the indexes

e.g.

In [121]:

# create numpy array elements
a = np.array([2, 3, 4, 5, 6, 45, 67, 34])
x=np.where(a > 20)[0][:] # the result of array.where() is a tuple that contains an array of the indexes at the position 0
print(x)
type(x)

[5 6 7]


RangeIndex(start=0, stop=3, step=1)

### ... a Pandas dataframe

Finding an element in a pandas dataframe is similar to the syntax for a numpy array: 

```
df.query(expression)
```
with `expression` being very similar in syntax to a SQL query.
However you will then need to use `.index` command to get the indices, and convert it to a list (or any other type of dataset that you want)
e.g.

In [118]:
import pandas as pd
df=pd.DataFrame({'letters':['a','b','c'],'numbers1':[10,20,30],'numbers2':[0,30,10]})
print(df)
df_querya=df.query('letters == "a"')
df_query30=df.query('numbers1 == 30 or numbers2 == 30')
df_query50=df.query('numbers1 == 50')
print(type(df_query30))
list(df_query30.index) # note the conversion of the result of the query (a specific type of int) to an easy-to-handle list

  letters  numbers1  numbers2
0       a        10         0
1       b        20        30
2       c        30        10
<class 'pandas.core.frame.DataFrame'>


[1, 2]

# Final remarks



*   As you can see, the handling of lists, numpy arrays, and pandas dataframe is not consistant in Python: they each have their specific commands. Some commands may exist for more than 1 of those types of dataset, but their output can be very different. 
*   Juggling between several types of datasets is core in Python: you will get use to them with time.
*   Don't hesitate to go online to find answers.


# Exercises

*   The exercises are more or less in increasing order of difficulty.
*   There are many ways to solve each exercise. Try to be as efficient as possible. The answer provided is only there as an example. Don't hesitate to ask if you are not sure about your answer, or if you are stuck



## Exercise 1
Your are given this dataset:
```
cool_data = [['Perth',-4.434661,56.395336], ['Edinburgh', -4.20277, 55.95415], ['Aberdeen', -184.10272, 58.14548],['Inverness',-4.224721,58.477772],['Glasgow',-999,55.85781]]
```
Unfortunately there has been some problems with the projections: -1 degree has been applied to the latitudes, and +1 has been applied to the longitudes above 57 degrees.

Fix this dataset (but keep the original), and get rid of any strange value.



## Exercise 2

*This exercise is a classic! Again, there are many ways to solve it, and you might even come up with a better solution than me!*

Create a nice christmas tree consisting of 10 rows of the character `*` and 2 rows of the character `|`, so that it looks like that:

Tip: a pen and a paper might help you to tackle this



In [132]:
nb_rows_tree = 10
nb_rows_trunc = 2

n=nb_rows
for i in range (nb_rows_tree):
  print(' '*(n),'x'*(2*i+1))
  n=n-1
for j in range(nb_rows_tree+1,nb_rows_tree+1+nb_rows_trunc):
  print(' '*(nb_rows-1),'|'*3)

           x
          xxx
         xxxxx
        xxxxxxx
       xxxxxxxxx
      xxxxxxxxxxx
     xxxxxxxxxxxxx
    xxxxxxxxxxxxxxx
   xxxxxxxxxxxxxxxxx
  xxxxxxxxxxxxxxxxxxx
          |||
          |||
