# 1. Creating Numpy arrays

Numpy has many different types of data "containers": lists, dictionaries, tuples etc. However none of them allows for efficient numerical calculation, in particular not in multi-dimensional cases (think e.g. of operations on images). Numpy has been developed exactly to fill this gap. It provides a new data structure, the **numpy array**, and a large library of operations that allow to: 
- generate such arrays
- combine arrays in different ways (concatenation, stacking etc.)
- modify such arrays (projection, extraction of sub-arrays etc.)
- apply mathematical operations on them

Numpy is the base of almost the entire Python scientific programming stack. Many libraries build on top of Numpy, either by providing specialized functions to operate on them (e.g. scikit-image for image processing) or by creating more complex data containers on top of it. The data science library Pandas that will also be presented in this course is a good example of this with its dataframe structures.


In [1]:
import numpy as np

## 1.1 What is an array ?

Let us create the simplest example of an array by transforming a regular Python list into an array (we will see more advanced ways of creating arrays in the next chapters):

In [2]:
mylist = [2,5,3,9,5,2]

In [3]:
mylist

[2, 5, 3, 9, 5, 2]

In [4]:
myarray = np.array(mylist)

In [5]:
myarray

array([2, 5, 3, 9, 5, 2])

In [6]:
type(myarray)

numpy.ndarray

We see that ```myarray``` is a Numpy array thanks to the ```array``` specification in the output. The type also says that we have a numpy ndarray (n-dimensional). At this point we don't see a big difference with regular lists, but we'll see in the following sections all the operations we can do with these objects.

We can already see a difference with two basic attributes of arrays: their type and shape.

### 1.1.1 Array Type

Just like when we create regular variables in Python, arrays receive a type when created. Unlike regular list, **all** elements of an array always have the same type. The type of an array can be recovered through the ```.dtype``` method:

In [7]:
myarray.dtype

dtype('int32')

Depending on the content of the list, the array will have different types. But the logic of "maximal complexity" is kept. For example if we mix integers and floats, we get a float array:

In [8]:
myarray2 = np.array([1.2, 6, 7.6, 5])
myarray2

array([1.2, 6. , 7.6, 5. ])

In [9]:
myarray2.dtype

dtype('float64')

In general, we have the possibility to assign a type to an array. This is true here, as well as later when we'll create more complex arrays, and is done via the ```dtype``` option: 

### 1.1.2 Array shape

A very important property of an array is its **shape** or in other words the dimensions of each axis. That property can be accessed via the ```.shape``` property:

In [10]:
myarray

array([2, 5, 3, 9, 5, 2])

In [11]:
myarray.shape

(6,)

We see that our simple array has only one dimension of length 6. Now of course we can create more complex arrays. Let's create for example a *list of two lists*:

In [12]:
my2d_list = [[1,2,3], [4,5,6]]

my2d_array = np.array(my2d_list)
my2d_array

array([[1, 2, 3],
       [4, 5, 6]])

In [13]:
my2d_array.shape

(2, 3)

In [14]:
my3d_list = [[[1,2,3],[12,22,32]], [[4,5,6],[42,52,62]]]

my3d_list = np.array(my3d_list)
my3d_list

array([[[ 1,  2,  3],
        [12, 22, 32]],

       [[ 4,  5,  6],
        [42, 52, 62]]])

In [15]:
my3d_list.shape

(2, 2, 3)

We see now that the shape of this array is *two-dimensional*. We also see that we have 2 lists of 3 elements. In fact at this point we should forget that we have a list of lists and simply consider this object as a *matrix* with *two rows and three columns*.

## 1.2 Creating arrays

We have seen that we can turn regular lists into arrays. However this becomes quickly impractical for larger arrays. Numpy offers several functions to create particular arrays. 

### 1.2.1 Common simple arrays
For example an array full of zeros or ones:

In [16]:
one_array = np.ones((2,3))
one_array

array([[1., 1., 1.],
       [1., 1., 1.]])

In [17]:
zero_array = np.zeros((2,3))
zero_array

array([[0., 0., 0.],
       [0., 0., 0.]])

In [18]:
one_array1 = np.ones((2,2,3))
one_array1

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

One can also create diagonal matrix:

In [19]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

By default Numpy creates float arrays:

In [20]:
one_array.dtype

dtype('float64')

However as mentioned before, one can impose a type usine the ```dtype``` option:

In [21]:
one_array_int = np.ones((2,3), dtype=np.int8)
one_array_int

array([[1, 1, 1],
       [1, 1, 1]], dtype=int8)

In [22]:
one_array_int.dtype

dtype('int8')

### 1.2.2 Copying the shape
Often one needs to create arrays of same shape. This can be done with "like-functions":

In [23]:
one_array

array([[1., 1., 1.],
       [1., 1., 1.]])

In [24]:
same_shape_array = np.zeros_like(one_array)
same_shape_array

array([[0., 0., 0.],
       [0., 0., 0.]])

In [25]:
one_array.shape

(2, 3)

In [26]:
same_shape_array.shape

(2, 3)

In [27]:
np.ones_like(one_array)

array([[1., 1., 1.],
       [1., 1., 1.]])

# 2. Creating Pandas dataframe

Python has a series of data containers (list, dicts etc.) and Numpy offers multi-dimensional arrays, however none of these structures offers a simple way neither to handle tabular data, nor to easily do standard database operations. This is why Pandas exists: it offers a complete ecosystem of structures and functions dedicated to handle large tables with inhomogeneous contents.

In this first chapter, we are going to learn about the two main structures of Pandas: Series and Dataframes.

## 2.1 Series

### 2.1.1 Simple series

Series are a the Pandas version of 1-D Numpy arrays. We are rarely going to use them directly, but they often appear implicitly when handling data from the more general Dataframe structure. We therefore only give here basics.

To understand Series' specificities, let's create one. Usually Pandas structures (Series and Dataframes) are created from other simpler structures like Numpy arrays or dictionaries:

In [28]:
numpy_array = np.array([4,8,38,1,6])
numpy_array

array([ 4,  8, 38,  1,  6])

The function ```pd.Series()``` allows us to convert objects into Series:

In [29]:
import pandas as pd
pd_series = pd.Series(numpy_array)
pd_series

0     4
1     8
2    38
3     1
4     6
dtype: int32

The underlying structure can be recovered with the ```.values``` attribute: 

In [30]:
pd_series.values

array([ 4,  8, 38,  1,  6])

Otherwise, indexing works as for regular arrays:

In [31]:
pd_series[0]

4

### 2.1.2 Indexing

On top of accessing values in a series by regular indexing, one can create custom indices for each element in the series:

In [32]:
pd_series2 = pd.Series(numpy_array, index=['a', 'b', 'c', 'd','e'])

In [33]:
pd_series2

a     4
b     8
c    38
d     1
e     6
dtype: int32

Now a given element can be accessed either by using its regular index:

In [34]:
pd_series2[1]

  pd_series2[1]


8

or its chosen index:

In [35]:
pd_series2['b']

8

A more direct way to create specific indexes is to transform as dictionary into a Series:

In [36]:
composer_birth = {'Mahler': 1860, 'Beethoven': 1770, 'Puccini': 1858, 'Shostakovich': 1906}

In [37]:
pd_composer_birth = pd.Series(composer_birth)
pd_composer_birth

Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
dtype: int64

In [38]:
pd_composer_birth['Puccini']

1858

## 2.2 Dataframes

In most cases, one has to deal with more than just one variable, e.g. one has the birth year and the death year of a list of composers. Also one might have different types of information, e.g. in addition to numerical variables (year) one might have string variables like the city of birth. The Pandas structure that allow one to deal with such complex data is called a Dataframe, which can somehow be seen as an aggregation of Series with a common index.

### 2.2.1 Creating a Dataframe

To see how to construct such a Dataframe, let's create some more information about composers:

In [39]:
composer_death = pd.Series({'Mahler': 1911, 'Beethoven': 1827, 'Puccini': 1924, 'Shostakovich': 1975})
composer_city_birth = pd.Series({'Mahler': 'Kaliste', 'Beethoven': 'Bonn', 'Puccini': 'Lucques', 'Shostakovich': 'Saint-Petersburg'})

Now we can combine multiple series into a Dataframe by precising a variable name for each series. Note that all our series need to have the same indices (here the composers' name):

In [40]:
composer_birth = {'Mahler': 1860, 'Beethoven': 1770, 'Puccini': 1858, 'Shostakovich': 1906}
pd_composer_birth = pd.Series(composer_birth)
pd_composer_birth

Mahler          1860
Beethoven       1770
Puccini         1858
Shostakovich    1906
dtype: int64

In [41]:
composers_df = pd.DataFrame({'birth': pd_composer_birth, 'death': composer_death, 'city': composer_city_birth})
composers_df

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


A more common way of creating a Dataframe is to construct it directly from a dictionary of lists where each element of the dictionary turns into a column:

In [42]:
dict_of_list = {'birth': [1860, 1770, 1858, 1906], 'death':[1911, 1827, 1924, 1975], 
 'city':['Kaliste', 'Bonn', 'Lucques', 'Saint-Petersburg']}

In [43]:
pd.DataFrame(dict_of_list)

Unnamed: 0,birth,death,city
0,1860,1911,Kaliste
1,1770,1827,Bonn
2,1858,1924,Lucques
3,1906,1975,Saint-Petersburg


However we now lost the composers name. We can enforce it by providing, as we did before for the Series, a list of indices:

In [44]:
pd.DataFrame(dict_of_list, index=['Mahler', 'Beethoven', 'Puccini', 'Shostakovich'])

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


### 2.2.2 Accessing values

There are multiple ways of accessing values or series of values in a Dataframe. Unlike in Series, a simple bracket gives access to a column and not an index, for example:

In [45]:
composers_df['city']

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

returns a Series. Alternatively one can also use the *attributes* synthax and access columns by using:

In [46]:
composers_df.city

Mahler                   Kaliste
Beethoven                   Bonn
Puccini                  Lucques
Shostakovich    Saint-Petersburg
Name: city, dtype: object

The attributes synthax has some limitations, so in case something does not work as expected, revert to the brackets notation.

When specifiying multiple columns, a DataFrame is returned:

In [47]:
composers_df[['birth','city']]

Unnamed: 0,birth,city
Mahler,1860,Kaliste
Beethoven,1770,Bonn
Puccini,1858,Lucques
Shostakovich,1906,Saint-Petersburg


One of the important differences with a regular Numpy array is that here, regular indexing doesn't work:

Instead one has to use either the ```.iloc[]``` or the ```.loc[]``` method. ```.iloc[]``` can be used to recover the regular indexing:

In [48]:
composers_df.iloc[0]

birth       1860
death       1911
city     Kaliste
Name: Mahler, dtype: object

In [49]:
 composers_df.iloc[0,1]

1911

While ```.loc[]``` allows one to recover elements by using the **explicit** index, on our case the composers name:

In [50]:
composers_df.loc['Mahler','death']

1911

In [51]:
composers_df

Unnamed: 0,birth,death,city
Mahler,1860,1911,Kaliste
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques
Shostakovich,1906,1975,Saint-Petersburg


**Remember that ```loc``` and ``iloc``` use brackets [] and not parenthesis ().**

Numpy style indexing works here too

In [52]:
composers_df.iloc[1:3]

Unnamed: 0,birth,death,city
Beethoven,1770,1827,Bonn
Puccini,1858,1924,Lucques


In [53]:
composers_df.iloc[1:3,:2]

Unnamed: 0,birth,death
Beethoven,1770,1827
Puccini,1858,1924


If you are working with a large table, it might be useful to sometimes have a list of all the columns. This is given by the ```.keys()``` attribute:

In [54]:
composers_df.keys()

Index(['birth', 'death', 'city'], dtype='object')

### 2.2.3 Adding columns

It is very simple to add a column to a Dataframe. One can e.g. just create a column a give it a default value that we can change later:

In [55]:
composers_df['country'] = 'default'

In [56]:
composers_df

Unnamed: 0,birth,death,city,country
Mahler,1860,1911,Kaliste,default
Beethoven,1770,1827,Bonn,default
Puccini,1858,1924,Lucques,default
Shostakovich,1906,1975,Saint-Petersburg,default


Or one can use an existing list:

In [57]:
country = ['Austria','Germany','Italy','Russia']

In [58]:
composers_df['country2'] = country

In [59]:
composers_df

Unnamed: 0,birth,death,city,country,country2
Mahler,1860,1911,Kaliste,default,Austria
Beethoven,1770,1827,Bonn,default,Germany
Puccini,1858,1924,Lucques,default,Italy
Shostakovich,1906,1975,Saint-Petersburg,default,Russia


### Save to CSV
After creating or cleaning your dataset, you will often want to **save your results** so they can be reused later or shared with others.

pandas provides the function **`to_csv()`** to export a DataFrame into a CSV file.

#### Basic Usage
```python
DataFrame.to_csv("filename.csv")

In [61]:
# Save to CSV with index
composers_df.to_csv("Boyce+country_with_index.csv")

# Save to CSV without index
composers_df.to_csv("Boyce+country_no_index.csv", index=False)

print("Files have been saved successfully!")

Files have been saved successfully!
