# Arrays and DataFrames

Python has several built-in types of data structures such as lists and dictionaries, but none of them allows you to efficiently do computations on large lists or tables of data. The Numpy and Pandas libraries fill this gap by providing 1) specific data structures on which one can efficiently compute and 2) functions to perform the computations on these structures.

Numpy is the core scientific computing library on which virtually all other scientific Python packages are building (except for deep learning libraries). This includes notably Scipy, a package offering numerical routines for optimization, regression etc. that will be used in the last series of lectures.

Pandas, which itself builds on top of Numpy, is the core data science library on which higher level libraries (e.g. for plotting) are building. 

Here we briefly present the main data structure offered by Pandas, the **DataFrame** and then show how the data in DataFrames are actually **arrays**, the main data structure of Numpy. This allows us in the next notebooks to explore Numpy before coming back at the end of the course to Pandas.

Let's first import both Pandas and Numpy. Both have very commonly used abbreviations ```np``` and ```pd``` that you should use as well to simplify your life. Note that most sub-modules are directly accessible and you won't have to import specific ones as migth be the case for other libraries.

In [1]:
import pandas as pd
import numpy as np

## Importing data
Just like any other library, Pandas offers functionality through the *dot-notation* (as e.g. in ```math.cos()```). When surveying usage of Pandas in places like GitHub, it turns out that the most used function is the one allowing one to import a CSV (comma-separated values). Indeed while we could import tabular data with the ```read``` function, this becomes quickly cumbersome for complex data and Pandas takes care of all details for us. 


### read_*** functions

As an example we import here the [bacteria_growth.csv](data/bacteria_growth.csv) file which contains information about the size and growth rate of bacteria in different conditions. Pandas has many importing functions and we use here ```read_csv```. You can see the list of all available importers by typing ```pd.read_``` and seeing options from autocomplete.

The Pandas reader function can read local files or download them i.e. we can also indicate a url to a file.

In [14]:
bacteria = pd.read_csv('data/bacteria_growth/bact_glucose.csv')

Let's first see what this object is:

In [15]:
type(bacteria)

pandas.core.frame.DataFrame

We see that we are dealing with a DataFrame, the two-dimensional data structure offered by Pandas. If we just execute a cell with that variable we can display the first lines:

In [16]:
bacteria

Unnamed: 0,Ld,Lb,growth_rate,length,condition
0,65.867358,46.012743,21.254797,[47 48 50 51 52 53 55 54 61 63 65],glucose
1,51.483599,21.078821,15.524108,[22 21 24 24 27 25 27 29 31 30 32 35 35 40 39 ...,glucose
2,60.968305,23.334875,15.877922,[23 23 26 25 32 28 30 30 30 34 35 37 40 47 44 ...,glucose
3,54.485996,30.307942,15.362987,[31 33 34 35 35 37 39 41 43 44 49 50 53],glucose
4,58.706814,27.763908,14.810464,[29 30 30 33 34 34 37 37 39 42 44 47 48 51 54 57],glucose
...,...,...,...,...,...
222,54.703412,33.658902,18.554392,[36 37 36 37 38 39 41 43 44 47 49 52 54],glucose
223,52.103220,26.761327,14.564780,[28 29 29 30 31 34 35 39 39 41 43 43 49 50],glucose
224,65.489842,26.603234,14.619101,[27 29 31 30 31 34 35 34 39 42 43 44 48 49 52 ...,glucose
225,75.598004,45.519710,23.228589,[45 47 48 51 51 52 54 56 58 61 63 63 65 65 68 ...,glucose


We see that we are dealing with tabular data of various types: we have text (e.g. ```condition```), numbers (```Ld```) and even lists (```length```) and numbers. Just like in a tabular sheet, each column has a label and each line has an index (column in bold on the left), so that every element can be located with these *coordinates*.

## Methods attached to dataframes

We have seen before that every variable in Python has a series of functions attached to it (methods). For example if we have a text variable (string) we can e.g. split it at a given place:

In [17]:
my_string = 'This is a sentence with spaces.'
my_string.split(' ')

['This', 'is', 'a', 'sentence', 'with', 'spaces.']

The same logic applies to Pandas DataFrames and in general to any new variable that is created by a given package: they all come with a series of methods attached to them.

For example if we only want to display the fist 5 lines of the DataFrame, we can use the ```head``` method:

In [18]:
bacteria.head(5)

Unnamed: 0,Ld,Lb,growth_rate,length,condition
0,65.867358,46.012743,21.254797,[47 48 50 51 52 53 55 54 61 63 65],glucose
1,51.483599,21.078821,15.524108,[22 21 24 24 27 25 27 29 31 30 32 35 35 40 39 ...,glucose
2,60.968305,23.334875,15.877922,[23 23 26 25 32 28 30 30 30 34 35 37 40 47 44 ...,glucose
3,54.485996,30.307942,15.362987,[31 33 34 35 35 37 39 41 43 44 49 50 53],glucose
4,58.706814,27.763908,14.810464,[29 30 30 33 34 34 37 37 39 42 44 47 48 51 54 57],glucose


We will learn later how to extract statistics, but just as an example of functions that exsit, you can for example get a basic statistical description of a DataFrame using the ```describe``` method:

In [19]:
bacteria.describe()

Unnamed: 0,Ld,Lb,growth_rate
count,227.0,227.0,227.0
mean,57.483635,27.948543,18.678329
std,7.808054,5.874752,4.683287
min,42.948384,19.523587,11.785992
25%,51.694862,24.169701,16.065863
50%,56.01761,26.263799,17.661124
75%,60.595562,30.082148,19.921363
max,103.100417,56.252563,53.514923


We see that in this particular case, the returned object is a DataFrame as well!

## Accessing columns

If we want to work with only a given column from a dataframe, we can extract it. Just like when we want to extract an element from a regular list, e.g. ```my_list[3]```, for Pandas we also use the square parenthesis (brackets) but this time with the name of the column:

In [20]:
some_column = bacteria['Ld']
some_column

0      65.867358
1      51.483599
2      60.968305
3      54.485996
4      58.706814
         ...    
222    54.703412
223    52.103220
224    65.489842
225    75.598004
226    55.935799
Name: Ld, Length: 227, dtype: float64

As we have a single columns here, we are not dealing with a DataFrame anymore but with a Series:

In [21]:
type(some_column)

pandas.core.series.Series

We won't explore much the Series object on its own in this course, but know that you can for example create a DataFrame from scratch by combining multiple Series.

## What is underlying Pandas

We see above that ```some_column``` is composed of and index (0, 1, 2 etc) and the corresponding values (65.86, 51.48 etc.). Underlying this object is just a list of numbers that we can access using the ```values``` parameter:

In [22]:
actual_values = some_column.values
actual_values

array([ 65.86735805,  51.48359947,  60.96830488,  54.48599598,
        58.70681418,  50.24489709,  54.57645043,  51.31471177,
        55.40598884,  53.1871301 ,  53.4183105 ,  58.27425747,
        53.48041021,  54.1192352 ,  54.84970684,  50.0999626 ,
        60.19155736,  56.83683752,  54.4873395 ,  56.01761002,
        62.59942566,  51.31843236,  56.62032762,  52.05086737,
        50.18816107,  51.81342814,  51.18012999,  59.05227627,
        49.65537319,  57.81452246,  53.58060185,  60.16063682,
        56.74832791,  58.80612207,  52.93445669,  52.33525619,
        58.73078548,  60.02897025,  58.62064383,  52.07703658,
        59.93047728,  51.63528828,  49.33709914,  58.06904114,
        67.20348055,  61.44785587,  54.40922809,  47.18800193,
        57.44733883,  65.84898568,  56.81898508,  51.92820273,
        65.55086101,  63.10502017,  59.84902928,  57.80726022,
        61.50449708,  60.89626607,  60.00633675,  56.18448575,
        60.03863271,  60.07136331,  65.89833811,  48.09

We see that the output is not just a simple Python list. It is in fact called an ```array```. If we ask for the type of this object, we get:

In [23]:
type(actual_values)

numpy.ndarray

The lists of values contained in Pandas DataFrame are in fact Numpy arrays. These are in principle just lists of lists (with any dimensions) but in contrast to native Python lists, we can handle then as single objects and for example execute computations directly on the whole array instead of writing a loop traversing all values:

In [24]:
actual_values / 10

array([ 6.58673581,  5.14835995,  6.09683049,  5.4485996 ,  5.87068142,
        5.02448971,  5.45764504,  5.13147118,  5.54059888,  5.31871301,
        5.34183105,  5.82742575,  5.34804102,  5.41192352,  5.48497068,
        5.00999626,  6.01915574,  5.68368375,  5.44873395,  5.601761  ,
        6.25994257,  5.13184324,  5.66203276,  5.20508674,  5.01881611,
        5.18134281,  5.118013  ,  5.90522763,  4.96553732,  5.78145225,
        5.35806019,  6.01606368,  5.67483279,  5.88061221,  5.29344567,
        5.23352562,  5.87307855,  6.00289703,  5.86206438,  5.20770366,
        5.99304773,  5.16352883,  4.93370991,  5.80690411,  6.72034805,
        6.14478559,  5.44092281,  4.71880019,  5.74473388,  6.58489857,
        5.68189851,  5.19282027,  6.5550861 ,  6.31050202,  5.98490293,
        5.78072602,  6.15044971,  6.08962661,  6.00063367,  5.61844857,
        6.00386327,  6.00713633,  6.58983381,  4.80982228,  6.02948583,
        5.61998035,  5.88528161,  4.90585957,  5.12771774,  6.40



A large part of the computing logic in Pandas is inspired from Numpy, therefore we now make a detour to Numpy before coming back to DataFrames which is the data type we are mostly interested in in this course.

## Exercise

1. Try to import the ```penguins.csv``` file in the ```data``` folder

2. Use the ```head``` method to see the first 6 elements of the table.

3. Can you find a method (function attached to the object) to calculate only means in the table?

4. What do you observe in the result? Are some columns missing?