# Pandas

## Pandas, pour quel genre de données ?

In [None]:
import pandas as pd

Pour charger le package pandas et commencer à travailler, on l'importe.

La convention dans la communauté Python est de l'importer en tant que `pd`, donc toute la documentation présume que c'est ce que vous avez fait.

### Représentation d'une table de données pandas

![](img/01_table_dataframe.svg)

Je veux stocker les données à propos des passagers du Titanic. Pour un certain nombre de passagers, je connais leurs noms (du texte), leur âge (des entiers), et leur sexe (M/F).

In [None]:
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)


df

Pour créer un tableau de données à la main, on crée une instance de `DataFrame`. Si on lui passe un dictionnaire python contenant des listes, les clés du dictionnaire seront les noms des colonnes, et les valeurs du dictionnaire (des listes) seront le contenu des colonnes.

Une `DataFrame` est une structure de données 2D qui peut stocker différents types de données (texte, entiers, réels, catégoriques, dates…) dans des colonnes. C'est similaire à un fichier tableur, une table SQL dans une base de données, ou l'objet `data.frame` du langage R.


Dans notre table,
- Il y a 3 colonnes, chacune avec son nom. Les noms sont respectivement `Name`, `Age` and `Sex`.
- La colonne `Name` contient des données texte, chaque valeur est un string. La colonne `Age` contient des nombres, et la colonne `Sex` contient aussi du texte

Dans un logiciel tableur, nos données aurait une représentation très similaire :milar:

![](img/01_table_spreadsheet.png)

### Chaque colonne est une instance de  `Series`

![](img/01_table_series.svg)

Je m'intéresse uniquement aux données dans la colonne `Age`

In [None]:
df["Age"]

Quand on sélectionne une seule colonne dans une `DataFrame`, le résultat est une `pandas.Series`. Pour sélectionner une colonne
on utilise le nom de la colonne entre crochets `[]`.



<div class='alert alert-info'>
Si vous êtes familiers des dictionnaires Python, la sélection d’une colonne unique est très similaire à la sélection d'une valeur dans un dictionnaire via sa clé.
</div>

On peut créer une Series ex-nihilo :

In [None]:
ages = pd.Series([22, 35, 58], name="Age")
ages

Une `pandas.Series` n'a pas de libellé de colonne, c'est juset une colonne d'une DataFrame. Mais elle a bien des libellés de ligne (par défaut 0, 1, 2 …)

### Agir sur une `pandas.Series`
Je veux connaître l’âge le plus élevé parmi les passagers.

On peut le trouver en sélectionnant la colonne `"Age"` dans notre `DataFrame` et en appliquant la méthode `.max()`.

In [None]:
df["Age"].max()

Idem sur une simple `Series` :

In [None]:
ages.max()

Comme illustré par la méthode `.max()`, on peut faire des choses avec une `DataFrame` ou une `Series`. Pandas nous offre plein de fonctionnalités, sous la forme de méthodes à utiliser sur une `DataFrame` ou `Series`. Comme les méthodes sont des fonctions, pensez bien à ajouter les parenthèses après leur nom `()`.

### Je veux voir des statistiques de base sur mes données numériques 

In [None]:
df.describe()

La méthode `describe()` nous donne un aperçu rapide des données numériques dans notre `DataFrame`. Comme les colonnes `Age` et `Sex` sont des données textuelles, elles sont ignorées par la méthode `describe()`.

De nombreuses opérations renvoient une nouvelle `DataFrame` ou `Series`. La méthode `describe()` est un exemple d’opération qui renvoie une `DataFrame`.

<div class='alert alert-info'>

Ce n'est que le début. Comme dans un tableur, **pandas** représente les données sous la forme d'un tableau avec des colonnes et des lignes. En plus de la représentation, les manipulations de données et les calculs que vous pouvez faire dans un tableur sont également faisables avec **pandas**, et nous allons voir ça dans ce guide.
    
</div>

<div class='alert alert-success'>

    
**À retenir:**

- On importe la bibliothèque pandas avec `import pandas as pd`
- Un tableau de données est stocké dans un objet `pandas.DataFrame`
- Chaque colonne dans une `DataFrame` est une `Series`
- Vous pouvez réaliser des opérations en appliquant des méthodes à une`DataFrame` ou une`Series`

</div>

## Comment lire et écrire des données tabulaires

![](img/02_io_readwrite.svg)

Je veux analyser les données des passagers du Titanic, disponibles sous la forme d'un fichier `csv`.

In [None]:
# création d'un sous-dossier data
!mkdir data
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv > data/titanic.csv

In [None]:
# chargement du CSV dans une DataFrame
titanic = pd.read_csv("data/titanic.csv")

Pandas a une fonction `read_csv(path)` qui va lire les données dans un fichier csv et vous renvoie une `DataFrame`. Pandas peut lire la plupart des formats de fichier de données  (csv, excel, sql, json, parquet, …) nativement, chacun de ces formats a sa fonction `read_*`.

Prenez le réflexe après avoir chargé un jeu de données de jeter un œil à la `DataFrame`. 

Appeler la `DataFrame` dans un notebook affiche les 5 premières et les 5 dernières lignes.

In [None]:
titanic

Je veux voir les 8 premières lignes de la DataFrame :

In [None]:
titanic.head(8)

La méthode `DataFrame.head(n)` permet de regarder les premières lignes (par défaut les 5 premières).

De même, la méthode `.tail(n)` affiche les `n` dernières lignes, et `.sample(n)` tire `n` lignes au hasard.

Pour vérifier comment pandas a interprété les données de chaque colonne, inspectez l'attribut `dtypes`.

In [None]:
titanic.dtypes

Pour chaque colonne, le type de données utilisé est affiché.

Ici on a : 
- des entiers (int64), 
- des réels (float64),
- des strings (object).

<div class='alert alert-info'>

Quand on demande `.dtypes`, il n’y a pas de parenthèses ! `dtypes` est un attribut des DataFrame et Series. Ce sont des variables internes, et non pas des fonctions, donc pas de parenthèses à ajouter à la fin. Les attributs sont des données internes, les méthodes (qui nécessitent des parenthèses) sont des fonctions, ou actions internes.

</div>

Mon collègue me demande les données du Titanic, sous la forme d'un fichier tableur. Dans le doute, on va lui faire en Excel et en LibreOffice.

On a juste deux petits packages à installer, `openpyxl` et `odfpy`.

In [None]:
# pour les utilisateurs d'anaconda
!conda install -c anaconda openpyxl odfpy

In [None]:
# pour ceux qui utilisent pip directement
!pip install openpyxl odfpy

In [None]:
# création d'un sous-dossier export
!mkdir export

# export avec la méthode .to_excel()
titanic.to_excel(
    "export/titanic.xlsx", sheet_name="passengers", index=False
)  # export vers excel

titanic.to_excel(
    "export/titanic.ods", sheet_name="passengers", index=False, engine="odf"
)  # export vers un fichier LibreOffice

Les fonctions `read_*` sont utilisées pour charger des données venant de fichiers vers une DataFrame, les fonctions `to_*` font l'opposé.

Dans l'exemple ci-dessus, le nom de la feuille est spécifié (sinon ce serait bêtement "Sheet1". L'option `index=False` fait en sorte que le libellé de chaque ligne ne soit pas exporté.


In [None]:
titanic = pd.read_excel("export/titanic.xlsx", sheet_name="passengers") # on recharge les données depuis le fichier excel

In [None]:
titanic.head() # est-ce que tout est bien là ?

### ❓ Je veux un résumé technique de ma `DataFrame`

In [None]:
titanic.info()

La méthode `.info()` me donne un résumé technique de ma DataFrame, regardons ça plus en détail.

- C'est bien une DataFrame.
- Il y a 891 entrées, soit 891 lignes. Chaque ligne a un libellé (appelé l'index), avec des valeurs entre 0 et 890.
- La table a 12 colonnes. 
- La plupart des colonnes ont une valeur dans chaque ligne (quand il y a 891 valeurs non-nulles). 
- Mais certaines colonnes ont moins de 891 valeurs non-nulles, donc il y a des valeurs manquantes par endroits. 
- Les colonnes `Name`, `Sex`, `Cabin` et `Embarked` sont des données textuelles (strings, ici désigné en tant que "object"). 
- Les autres colonnes sont numériques, certaines sont des entiers, d'autres des réels (float).
- Une estimation de l’emprunte mémoire de la DataFrame est indiquée

<div class='alert alert-success'>

**À retenir**
    
- Obtenir des données depuis différents types de fichiers est fait avec les fonctions qui commencent par `read_`.
- Exporter les données depuis pandas vers un fichier est fait par les différentes méthodes de DataFrame qui commencent par `to_`.
- Les méthodes `head`, `tail`, `info` et l'attribut `dtypes` sont utiles pour faire une première vérification sur les données .

</div>

## Sélectionner un sous-ensemble d'une `DataFrame`

In [None]:
# création de la dataframe en repartant du CSV titanic
titanic = pd.read_csv("data/titanic.csv")
titanic.head()

### ❓ Comment sélectionner certaines colonnes 

![](img/03_subset_columns.svg)

Je veux uniquement l'âge des passagers

In [None]:
ages = titanic["Age"]

In [None]:
ages.head()

Pour sélectionner une seul colonne, on utilise des crochets `[]` avec le nom de la colonne.

Chaque colonne est un objet `Series`. Quand on sélectionne une seule colonne, l'objet renvoyé est une `Series`. On peut s'en assurer avec la 
foction `type()`.

In [None]:
type(titanic["Age"])

Ou regarder la forme de cet objet :

In [None]:
titanic["Age"].shape

`.shape` est un attribut (souvenez-vous, ce n'est pas une méthode, pas de parenthèses) sur une DataFrame ou une Series, qui contient le nombre de lignes et de colonnes. (nlignes, ncolonnes). 

Une Series est un tableau à 1 dimension, donc le tuple ne contient que le nombre de lignes.

---
Je veux m'intéresser à l'âge et au sexe des passagers du Titanic.

In [None]:
age_sex = titanic[["Age", "Sex"]]

In [None]:
age_sex.head()

TODO

To select multiple columns, use a list of column names within the selection brackets [].

<div class='alert alert-info'>

The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in the previous example.

</div>

The returned data type is a pandas DataFrame:

In [None]:
type(titanic[["Age", "Sex"]])

In [None]:
titanic[["Age", "Sex"]].shape

The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

### Comment sélectionner certaines lignes dans la `DataFrame` 

I’m interested in the passengers older than 35 years.

In [None]:
above_35 = titanic[titanic["Age"] > 35]

In [None]:
above_35.head()

To select rows based on a conditional expression, use a condition inside the selection brackets [].

The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:

In [None]:
titanic["Age"] > 35

The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:

In [None]:
above_35.shape

I’m interested in the Titanic passengers from cabin class 2 and 3.

In [None]:
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()

Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.


The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:


In [None]:
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

class_23.head()

<div class='alert alert-info'>



When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &.


</div>

I want to work with passenger data for which the age is known.

In [None]:
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()

The notna() conditional function returns a True for each row the values are not an Null value. As such, this can be combined with the selection brackets [] to filter the data table.

You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:

In [None]:
age_no_na.shape

### Comment sélectionner des lignes et colonnes spécifiques

![](img/03_subset_columns_rows.svg)

I’m interested in the names of the passengers older than 35 years.

In [None]:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

In [None]:
adult_names.head()

In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

I’m interested in rows 10 till 25 and columns 3 to 5.

In [None]:
titanic.iloc[9:25, 2:5]

    Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column:

In [None]:
titanic.iloc[0:3, 3] = "anonymous"

In [None]:
titanic.head()

<div class='alert alert-info'>


REMEMBER

- When selecting subsets of data, square brackets [] are used.

-   Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

-   Select specific rows and/or columns using loc when using the row and column names

-   Select specific rows and/or columns using iloc when using the positions in the table

-   You can assign new values to a selection based on loc/iloc.



</div>

## Comment faire des graphes en Pandas

In [None]:
import matplotlib.pyplot as plt

In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2.csv > data/air_quality_no2.csv

In [None]:
air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)

air_quality.head()

<div class='alert alert-info'>



The usage of the index_col and parse_dates parameters of the read_csv function to define the first (0th) column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively.
    
    
</div>

![](img/04_plot_overview.svg)

I want a quick visual check of the data.

In [None]:
air_quality.plot();

With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.

I want to plot only the columns of the data table with the data from Paris.

In [None]:
air_quality["station_paris"].plot();

To plot a specific column, use the selection method of the subset data tutorial in combination with the plot() method. Hence, the plot() method works on both Series and DataFrame.

In [None]:
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5);

Apart from the default line plot when using the plot function, a number of alternatives are available to plot data. Let’s use some standard Python to get an overview of the available plot methods:

In [None]:
[
    method_name
    for method_name in dir(air_quality.plot)
    if not method_name.startswith("_")
]

<div class='alert alert-info'>

In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an overview of the available methods, for example air_quality.plot. + TAB.

</div>

In [None]:
air_quality.plot.box();

I want each of the columns in a separate subplot.

In [None]:
axs = air_quality.plot.area(figsize=(12, 4), subplots=True)

Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The builtin options available in each of the pandas plot functions are worth reviewing.

I want to further customize, extend or save the resulting plot.

In [None]:
fig, axs = plt.subplots(figsize=(12, 4))
air_quality.plot.area(ax=axs)
axs.set_ylabel("NO$_2$ concentration")
fig.savefig("export/no2_concentrations.png")

Each of the plot objects created by pandas is a matplotlib object. As Matplotlib provides plenty of options to customize plots, making the link between pandas and Matplotlib explicit enables all the power of matplotlib to the plot. This strategy is applied in the previous example:

In [None]:
fig, axs = plt.subplots(figsize=(12, 4))  # Create an empty matplotlib Figure and Axes
air_quality.plot.area(
    ax=axs
)  # Use pandas to put the area plot on the prepared Figure/Axes
axs.set_ylabel("NO$_2$ concentration")  # Do any matplotlib customization you like
fig.savefig(
    "export/no2_concentrations.png"
)  # Save the Figure/Axes using the existing matplotlib method.

<div class="alert alert-info">


- The `.plot.*` methods are applicable on both Series and DataFrames
- By default, each of the columns is plotted as a different element (line, boxplot,…)
- Any plot created by pandas is a Matplotlib object.


</div>

## Comment créer de nouvelles colonnes dérivées des colonnes existantes 

![](img/05_newcolumn_1.svg)

In [None]:
air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)

air_quality.head()

I want to express the $NO_2$ concentration of the station in London in mg/m

*(If we assume temperature of 25 degrees Celsius and pressure of 1013 hPa, the conversion factor is 1.882)*

In [None]:
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality.head()

<div class='alert alert-info'>

The calculation of the values is done element_wise. This means all values in the given column are multiplied by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!

</div>

![](img/05_newcolumn_2.svg)

I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column

In [None]:
air_quality["ratio_paris_antwerp"] = (
    air_quality["station_paris"] / air_quality["station_antwerp"]
)


air_quality.head()

    The calculation is again element-wise, so the / is applied for the values in each row.

Also other mathematical operators (+, -, \*, /) or logical operators (<, >, =,…) work element wise. The latter was already used in the subset data tutorial to filter rows of a table using a conditional expression.

If you need more advanced logic, you can use arbitrary Python code via apply().

    I want to rename the data columns to the corresponding station identifiers used by openAQ

In [None]:
air_quality_renamed = air_quality.rename(
    columns={
        "station_antwerp": "BETR801",
        "station_paris": "FR04014",
        "station_london": "London Westminster",
    }
)

In [None]:
air_quality_renamed.head()

    The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.

The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example, converting the column names to lowercase letters can be done using a function as well:

In [None]:
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)

air_quality_renamed.head()

<div class'alert alert-info'>


- Create a new column by assigning the output to the DataFrame with a new column name in between the [].
- Operations are element-wise, no need to loop over rows.
- Use rename with a dictionary or function to rename row labels or column names.



</div>

## Comment calculer des statistiques sur mes données 
Données pour cette section : Titanic

In [None]:
# recréons notre DataFrame titanic à partir du csv
titanic = pd.read_csv("data/titanic.csv")

titanic.head()

### stats aggrégées

![](img/06_aggregate.svg)

What is the average age of the Titanic passengers?

In [None]:
titanic["Age"].mean()

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

![](img/06_reduction.svg)

What is the median age and ticket fare price of the Titanic passengers?

In [None]:
titanic[["Age", "Fare"]].median()

    The statistic applied to multiple columns of a DataFrame (the selection of two columns return a DataFrame, see the subset data tutorial) is calculated for each numeric column.

The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function from first tutorial?

In [None]:
titanic[["Age", "Fare"]].describe()

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method:

In [None]:
titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)

## Aggregating statistics grouped by category

![](img/06_groupby.svg)

What is the average age for male versus female Titanic passengers?

In [None]:
titanic[["Sex", "Age"]].groupby("Sex").mean()

As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.

https://pandas.pydata.org/docs/_images/06_groupby.svg

Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The groupby method is used to support this type of operations. More general, this fits in the more general split-apply-combine pattern:

- Split the data into groups
- Apply a function to each group independently
- Combine the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns:

In [None]:
titanic.groupby("Sex").mean()

It does not make much sense to get the average value of the Pclass. if we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:

In [None]:
titanic.groupby("Sex")["Age"].mean()

![](06_groupby_select_detail.svg)

<div class='alert alert-info'>

The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a Categorical data type to handle this type of data. More information is provided in the user guide Categorical data section.

</div>

What is the mean ticket fare price for each of the sex and cabin class combinations?


In [None]:
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby() method.

### Compter le nombre d’enregistrements par catégorie

![](img/06_valuecounts.svg)

    The value_counts() method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [None]:
titanic.groupby("Pclass")["Pclass"].count()

<div class='alert alert-info'>


Both size and count can be used in combination with groupby. Whereas size includes NaN values and just provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use the dropna argument to include or exclude the NaN values.

</div>

<div class='alert alert-warning'>



- Aggregation statistics can be calculated on entire columns or rows
- groupby provides the power of the split-apply-combine pattern
- value_counts is a convenient shortcut to count the number of entries in each category of a variable

    
</div>

## How to reshape the layout of tables

### Données pour cette section

In [None]:
titanic = pd.read_csv("data/titanic.csv")

In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_long.csv > data/air_quality_long.csv

In [None]:
air_quality = pd.read_csv(
    "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
)

In [None]:
air_quality.head()

### Classer les lignes de la `DataFrame`

I want to sort the Titanic data according to the age of the passengers.

In [None]:
titanic.sort_values(by="Age").head()

I want to sort the Titanic data according to the cabin class and age in descending order.

In [None]:
titanic.sort_values(by=["Pclass", "Age"], ascending=False).head()

With `Series.sort_values()`, the rows in the table are sorted according to the defined column(s). The index will follow the row order.


### Long to wide table format


Let’s use a small subset of the air quality data set. We focus on data and only use the first two measurements of each location (i.e. the head of each group). The subset of data will be called no2_subset

In [None]:
# filter for no2 data only

no2 = air_quality[air_quality["parameter"] == "no2"]

In [None]:
# use 2 measurements (head) for each location (groupby)

no2_subset = no2.sort_index().groupby(["location"]).head(2)

In [None]:
no2_subset

![](img/07_pivot.svg)

I want the values for the three stations as separate columns next to each other


In [None]:
no2_subset.pivot(columns="location", values="value")

The `pivot()` function is purely reshaping of the data: a single value for each index/column combination is required.

As pandas support plotting of multiple columns (see plotting tutorial) out of the box, the conversion from long to wide table format enables the plotting of the different time series at the same time:


In [None]:
no2.head()

In [None]:
no2.pivot(columns="location", values="value").plot();

<div class='alert alert-info'>

When the index parameter is not defined, the existing index (row labels) is used.

</div>

### Pivoter la table

![](img/07_pivot_table.svg)


I want the mean concentrations for $NO_2$ and $PM_{2.5}$ in each of the stations in table form

In [None]:
air_quality.pivot_table(
    values="value", index="location", columns="parameter", aggfunc="mean"
)

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific case, the values on different time steps) pivot_table() can be used, providing an aggregation function (e.g. mean) on how to combine these values.

Pivot table is a well known concept in spreadsheet software. When interested in summary columns for each variable separately as well, put the margin parameter to True:

In [None]:
air_quality.pivot_table(
    values="value",
    index="location",
    columns="parameter",
    aggfunc="mean",
    margins=True,
)

In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can be derived by grouping on both parameter and location:

`air_quality.groupby(["parameter", "location"]).mean()`



### Wide to long format

Starting again from the wide format table created in the previous section:

In [None]:
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()

no2_pivoted.head()

![](img/07_melt.svg)

I want to collect all air quality $NO_2$ measurements in a single column (long format)

In [None]:
no_2 = no2_pivoted.melt(id_vars="date.utc")
no_2.head()

    The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The column headers become the variable names in a newly created column.

The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT mentioned in id_vars together into two columns: A column with the column header names and a column with the values itself. The latter column gets by default the name value.

The pandas.melt() method can be defined in more detail:

In [None]:
no_2 = no2_pivoted.melt(
    id_vars="date.utc",
    value_vars=["BETR801", "FR04014", "London Westminster"],
    value_name="NO_2",
    var_name="id_location",
)

no_2.head()

The result in the same, but in more detail defined:

- value_vars defines explicitly which columns to melt together
- value_name provides a custom column name for the values column instead of the default column name value
- var_name provides a custom column name for the column collecting the column header names. Otherwise it takes the index name or a default variable

Hence, the arguments value_name and var_name are just user-defined names for the two generated columns. The columns to melt are defined by id_vars and value_vars.

<div class='alert alert-info'>

- Sorting by one or more columns is supported by sort_values
- The pivot function is purely restructuring of the data, pivot_table supports aggregations
- The reverse of pivot (long to wide format) is melt (wide to long format)


</div>

## How to combine data from multiple tables?

Données pour cette section :

In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2_long.csv > data/air_quality_no2_long.csv

### Données Nitrate

In [None]:
air_quality_no2 = pd.read_csv("data/air_quality_no2_long.csv", parse_dates=True)

air_quality_no2 = air_quality_no2[["date.utc", "location", "parameter", "value"]]

air_quality_no2.head()

### Données particules



In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_pm25_long.csv > data/air_quality_pm25_long.csv

In [None]:
air_quality_pm25 = pd.read_csv("data/air_quality_pm25_long.csv", parse_dates=True)


air_quality_pm25 = air_quality_pm25[["date.utc", "location", "parameter", "value"]]


air_quality_pm25.head()

### Concatenation d'objets `DataFrame`

![](img/08_concat_row.svg)

I want to combine the measurements of and , two tables with a similar structure, in a single table

In [None]:
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

air_quality.head()


The concat() function performs concatenation operations of multiple tables along one of the axis (row-wise or column-wise).

By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let’s check the shape of the original and the concatenated tables to verify the operation:

In [None]:
print("Shape of the ``air_quality_pm25`` table: ", air_quality_pm25.shape)

print("Shape of the ``air_quality_no2`` table: ", air_quality_no2.shape)

print("Shape of the resulting ``air_quality`` table: ", air_quality.shape)

Hence, the resulting table has 3178 = 1110 + 2068 rows.

<div class='alert alert-info'>


The axis argument will return in a number of pandas methods that can be applied along an axis. A DataFrame has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows (axis 0), but can be applied across columns as well.

</div>

Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table air_quality_pm25):



In [None]:
air_quality = air_quality.sort_values("date.utc")
air_quality.head()

In this specific example, the parameter column provided by the data ensures that each of the original tables can be identified. This is not always the case. the concat function provides a convenient solution with the keys argument, adding an additional (hierarchical) row index. For example:

In [None]:
air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=["PM25", "NO2"])
air_quality_.head()

<div class='alert alert-info'>

The existence of multiple row/column indices at the same time has not been mentioned within these tutorials. Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.

Multi-indexing is out of scope for this pandas introduction. For the moment, remember that the function reset_index can be used to convert any level of an index to a column, e.g. air_quality.reset_index(level=0)

</div>

### Join tables using a common identifier

![](img/08_merge_left.svg)

Add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements table.

<div class='alert alert-warning'>

The air quality measurement station coordinates are stored in a data file air_quality_stations.csv, downloaded using the py-openaq package.
    
</div>

In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_stations.csv  > data/air_quality_stations.csv

In [None]:
stations_coord = pd.read_csv("data/air_quality_stations.csv")

stations_coord.head()

<div class='alert alert-info'>

The stations used in this example (FR04014, BETR801 and London Westminster) are just three entries enlisted in the metadata table. We only want to add the coordinates of these three to the measurements table, each on the corresponding rows of the air_quality table.

</div>

In [None]:
air_quality.head()

In [None]:
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")

air_quality.head()

Using the merge() function, for each of the rows in the air_quality table, the corresponding coordinates are added from the air_quality_stations_coord table. Both tables have the column location in common which is used as a key to combine the information. By choosing the left join, only the locations available in the air_quality (left) table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The merge function supports multiple join options similar to database-style operations.

Add the parameter full description and name, provided by the parameters metadata table, to the measurements table

<div class='alert alert-warning'>

The air quality parameters metadata are stored in a data file air_quality_parameters.csv, downloaded using the py-openaq package.

</div>

In [None]:
# téléchargement d'un fichier CSV
!curl https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_parameters.csv  > data/air_quality_parameters.csv

In [None]:
air_quality_parameters = pd.read_csv("data/air_quality_parameters.csv")

air_quality_parameters.head()

In [None]:
air_quality = pd.merge(
    air_quality, air_quality_parameters, how="left", left_on="parameter", right_on="id"
)

air_quality.head()

Compared to the previous example, there is no common column name. However, the parameter column in the air_quality table and the id column in the air_quality_parameters_name both provide the measured variable in a common format. The left_on and right_on arguments are used here (instead of just on) to make the link between the two tables.

<div class 'alert alert-info'>


REMEMBER

- Multiple tables can be concatenated both column-wise and row-wise using the concat function.

- For database-like merging/joining of tables, use the merge function.

</div>

## How to handle time series data with ease?¶

In [None]:
import matplotlib.pyplot as plt

Data used for this tutorial: 

In [None]:
air_quality = pd.read_csv("data/air_quality_no2_long.csv")

air_quality = air_quality.rename(columns={"date.utc": "datetime"})

air_quality.head()

In [None]:
air_quality.city.unique()

### Using pandas datetime properties

I want to work with the dates in the column datetime as datetime objects instead of plain text

In [None]:
air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])

air_quality["datetime"]

Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC]) objects. In pandas we call these datetime objects similar to datetime.datetime from the standard library as pandas.Timestamp.


<div class='alert alert-info'>

As many data sets do contain datetime information in one of the columns, pandas input function like pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using the parse_dates parameter with a list of the columns to read as Timestamp:

`pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])`

</div>

Why are these pandas.Timestamp objects useful? Let’s illustrate the added value with some example cases.

> What is the start and end date of the time series data set we are working with?


In [None]:
air_quality["datetime"].min(), air_quality["datetime"].max()

Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable. Hence, we can use this to get the length of our time series:

In [None]:
air_quality["datetime"].max() - air_quality["datetime"].min()

The result is a pandas.Timedelta object, similar to datetime.timedelta from the standard Python library and defining a time duration.

I want to add a new column to the DataFrame containing only the month of the measurement

In [None]:
air_quality["month"] = air_quality["datetime"].dt.month

air_quality.head()

By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month, but also year, weekofyear, quarter,… All of these properties are accessible by the dt accessor.

What is the average concentration for each day of the week for each of the measurement locations?

In [None]:
air_quality.groupby([air_quality["datetime"].dt.weekday, "location"])["value"].mean()

Remember the split-apply-combine pattern provided by groupby from the tutorial on statistics calculation? Here, we want to calculate a given statistic (e.g. mean ) for each weekday and for each measurement location. To group on weekdays, we use the datetime property weekday (with Monday=0 and Sunday=6) of pandas Timestamp, which is also accessible by the dt accessor. The grouping on both locations and weekdays can be done to split the calculation of the mean on each of these combinations.


<div class='alert alert-danger'>
    
    
As we are working with a very short time series in these examples, the analysis does not provide a long-term representative result!
    
    
</div>

Plot the typical $NO_2$ pattern during the day of our time series of all stations together. In other words, what is the average value for each hour of the day?

In [None]:
fig, ax = plt.subplots(figsize=(12, 4))

air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
    kind="bar", rot=0, ax=ax
)

ax.set_xlabel("Hour of the day")
# custom x label using matplotlib

ax.set_ylabel("$NO_2 (µg/m^3)$");

Similar to the previous case, we want to calculate a given statistic (e.g. mean $NO_2$ ) for each hour of the day and we can use the split-apply-combine approach again. For this case, we use the datetime property hour of pandas Timestamp, which is also accessible by the dt accessor.

## Datetime comme index

In the tutorial on reshaping, pivot() was introduced to reshape the data table with each of the measurements locations as a separate column:

In [None]:
no_2 = air_quality.pivot(index="datetime", columns="location", values="value")

no_2.head()

<div class='alert alert-info'>

By pivoting the data, the datetime information became the index of the table. In general, setting a column as an index can be achieved by the set_index function.

</div>


Working with a datetime index (i.e. DatetimeIndex) provides powerful functionalities. For example, we do not need the dt accessor to get the time series properties, but have these properties available on the index directly:

In [None]:
no_2.index.year, no_2.index.weekday

Some other advantages are the convenient subsetting of time period or the adapted time scale on plots. Let’s apply this on our data.

? Create a plot of the $NO_2$ values in the different stations from the 20th of May till the end of 21st of May

In [None]:
no_2["2019-05-20":"2019-05-21"].plot();

> By providing a string that parses to a datetime, a specific subset of the data can be selected on a DatetimeIndex.

### Resample a time series to another frequency

? Aggregate the current hourly time series values to the monthly maximum value in each of the stations.

In [None]:
monthly_max = no_2.resample("M").max()

monthly_max

> A very powerful method on time series data with a datetime index, is the ability to resample() time series to another frequency (e.g., converting secondly data into 5-minutely data).

The `.resample()` method is similar to a groupby operation:

- it provides a time-based grouping, by using a string (e.g. M, 5H,…) that defines the target frequency
- it requires an aggregation function such as mean, max,…

When defined, the frequency of the time series is provided by the freq attribute:


In [None]:
monthly_max.index.freq

? Make a plot of the daily mean value in each of the stations.

In [None]:
no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));

<div class='alert alert-info'>


REMEMBER

- Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.
- Datetime objects in pandas support calculations, logical operations and convenient date-related properties using the dt accessor.
- A DatetimeIndex contains these date-related properties and supports convenient slicing.
- Resample is a powerful method to change the frequency of a time series.



</div>

## Comment manipuler des données textuelles ?

Données utilisée dans cette section : Titanic

In [None]:
titanic = pd.read_csv("data/titanic.csv")

titanic.head()

? Make all name characters lowercase.

In [None]:
titanic["Name"].str.lower()



> To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.

Similar to datetime objects in the time series tutorial having a dt accessor, a number of specialized string methods are available when using the str accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the columns.

? Create a new column Surname that contains the surname of the passengers by extracting the part before the comma.

In [None]:
titanic["Name"].str.split(",")

Using the Series.str.split() method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element is the part after the comma.

In [None]:
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

titanic["Surname"]

As we are only interested in the first part representing the surname (element 0), we can again use the str accessor and apply Series.str.get() to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!

? Extract the passenger data about the countesses on board of the Titanic.

In [None]:
titanic["Name"].str.contains("Countess")

In [None]:
titanic[titanic["Name"].str.contains("Countess")]

> (Interested in her story? See [Wikipedia](https://fr.wikipedia.org/wiki/Lucy_No%C3%ABl_Leslie_Martha)!)
>
> The string method Series.str.contains() checks for each of the values in the column Name if the string contains the word Countess and returns for each of the values True (Countess is part of the name) or False (Countess is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

<div class='alert alert-info'>
More powerful extractions on strings are supported, as the Series.str.contains() and Series.str.extract() methods accept regular expressions, but out of scope of this tutorial.
</div>

? Which passenger of the Titanic has the longest name?

In [None]:
titanic["Name"].str.len()

To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas string methods, the Series.str.len() function is applied to each of the names individually (element-wise).

In [None]:
titanic["Name"].str.len().idxmax()

Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is used.

In [None]:
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator, introduced in the tutorial on subsetting.

? In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.

In [None]:
titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})

titanic["Sex_short"]

Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a dictionary to define the mapping {from : to}.


<div class='alert alert-warning'>
    
There is also a replace() method available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:

`titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")`

`titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")`

This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if those two statements are applied in the opposite order…
    
</div>

<div class='alert alert-success'>

REMEMBER

- String methods are available using the str accessor.
-String methods work element-wise and can be used for conditional indexing.
- The replace method is a convenient method to convert values according to a given dictionary.


</div>

## Ressources supplémentaires


<div>
<section id="community-tutorials">
<span id="communitytutorials"></span><h1>Community tutorials<a class="headerlink" href="#community-tutorials" title="Permalink to this headline">¶</a></h1>
<p>This is a guide to many pandas tutorials by the community, geared mainly for new users.</p>
<section id="pandas-cookbook-by-julia-evans">
<h2>pandas cookbook by Julia Evans<a class="headerlink" href="#pandas-cookbook-by-julia-evans" title="Permalink to this headline">¶</a></h2>
<p>The goal of this 2015 cookbook (by <a class="reference external" href="https://jvns.ca">Julia Evans</a>) is to
give you some concrete examples for getting started with pandas. These
are examples with real-world data, and all the bugs and weirdness that
entails.
For the table of contents, see the <a class="reference external" href="https://github.com/jvns/pandas-cookbook">pandas-cookbook GitHub
repository</a>.</p>
</section>
<section id="pandas-workshop-by-stefanie-molin">
<h2>pandas workshop by Stefanie Molin<a class="headerlink" href="#pandas-workshop-by-stefanie-molin" title="Permalink to this headline">¶</a></h2>
<p>An introductory workshop by <a class="reference external" href="https://github.com/stefmolin">Stefanie Molin</a>
designed to quickly get you up to speed with pandas using real-world datasets.
It covers getting started with pandas, data wrangling, and data visualization
(with some exposure to matplotlib and seaborn). The
<a class="reference external" href="https://github.com/stefmolin/pandas-workshop">pandas-workshop GitHub repository</a>
features detailed environment setup instructions (including a Binder environment),
slides and notebooks for following along, and exercises to practice the concepts.
There is also a lab with new exercises on a dataset not covered in the workshop for
additional practice.</p>
</section>
<section id="learn-pandas-by-hernan-rojas">
<h2>Learn pandas by Hernan Rojas<a class="headerlink" href="#learn-pandas-by-hernan-rojas" title="Permalink to this headline">¶</a></h2>
<p>A set of lesson for new pandas users: <a class="reference external" href="https://bitbucket.org/hrojas/learn-pandas">https://bitbucket.org/hrojas/learn-pandas</a></p>
</section>
<section id="practical-data-analysis-with-python">
<h2>Practical data analysis with Python<a class="headerlink" href="#practical-data-analysis-with-python" title="Permalink to this headline">¶</a></h2>
<p>This <a class="reference external" href="https://wavedatalab.github.io/datawithpython">guide</a> is an introduction to the data analysis process using the Python data ecosystem and an interesting open dataset.
There are four sections covering selected topics as <a class="reference external" href="https://wavedatalab.github.io/datawithpython/munge.html">munging data</a>,
<a class="reference external" href="https://wavedatalab.github.io/datawithpython/aggregate.html">aggregating data</a>, <a class="reference external" href="https://wavedatalab.github.io/datawithpython/visualize.html">visualizing data</a>
and <a class="reference external" href="https://wavedatalab.github.io/datawithpython/timeseries.html">time series</a>.</p>
</section>
<section id="exercises-for-new-users">
<span id="tutorial-exercises-new-users"></span><h2>Exercises for new users<a class="headerlink" href="#exercises-for-new-users" title="Permalink to this headline">¶</a></h2>
<p>Practice your skills with real data sets and exercises.
For more resources, please visit the main <a class="reference external" href="https://github.com/guipsamora/pandas_exercises">repository</a>.</p>
</section>
<section id="modern-pandas">
<span id="tutorial-modern"></span><h2>Modern pandas<a class="headerlink" href="#modern-pandas" title="Permalink to this headline">¶</a></h2>
<p>Tutorial series written in 2016 by
<a class="reference external" href="https://github.com/TomAugspurger">Tom Augspurger</a>.
The source may be found in the GitHub repository
<a class="reference external" href="https://github.com/TomAugspurger/effective-pandas">TomAugspurger/effective-pandas</a>.</p>
<ul class="simple">
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-1-intro.html">Modern Pandas</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/method-chaining.html">Method Chaining</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-3-indexes.html">Indexes</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-4-performance.html">Performance</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-5-tidy.html">Tidy Data</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-6-visualization.html">Visualization</a></p></li>
<li><p><a class="reference external" href="https://tomaugspurger.github.io/modern-7-timeseries.html">Timeseries</a></p></li>
</ul>
</section>
<section id="excel-charts-with-pandas-vincent-and-xlsxwriter">
<h2>Excel charts with pandas, vincent and xlsxwriter<a class="headerlink" href="#excel-charts-with-pandas-vincent-and-xlsxwriter" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li><p><a class="reference external" href="https://pandas-xlsxwriter-charts.readthedocs.io/">Using Pandas and XlsxWriter to create Excel charts</a></p></li>
</ul>
</section>
<section id="video-tutorials">
<h2>Video tutorials<a class="headerlink" href="#video-tutorials" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li><p><a class="reference external" href="https://www.youtube.com/watch?v=5JnMutdy6Fw">Pandas From The Ground Up</a>
(2015) (2:24)
<a class="reference external" href="https://github.com/brandon-rhodes/pycon-pandas-tutorial">GitHub repo</a></p></li>
<li><p><a class="reference external" href="https://www.youtube.com/watch?v=-NR-ynQg0YM">Introduction Into Pandas</a>
(2016) (1:28)
<a class="reference external" href="https://github.com/chendaniely/2016-pydata-carolinas-pandas">GitHub repo</a></p></li>
<li><p><a class="reference external" href="https://www.youtube.com/watch?v=7vuO9QXDN50">Pandas: .head() to .tail()</a>
(2016) (1:26)
<a class="reference external" href="https://github.com/TomAugspurger/pydata-chi-h2t">GitHub repo</a></p></li>
<li><p><a class="reference external" href="https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y">Data analysis in Python with pandas</a>
(2016-2018)
<a class="reference external" href="https://github.com/justmarkham/pandas-videos">GitHub repo</a> and
<a class="reference external" href="https://nbviewer.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb">Jupyter Notebook</a></p></li>
<li><p><a class="reference external" href="https://www.youtube.com/playlist?list=PL5-da3qGB5IBITZj_dYSFqnd_15JgqwA6">Best practices with pandas</a>
(2018)
<a class="reference external" href="https://github.com/justmarkham/pycon-2018-tutorial">GitHub repo</a> and
<a class="reference external" href="https://nbviewer.org/github/justmarkham/pycon-2018-tutorial/blob/master/tutorial.ipynb">Jupyter Notebook</a></p></li>
</ul>
</section>
<section id="various-tutorials">
<h2>Various tutorials<a class="headerlink" href="#various-tutorials" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li><p><a class="reference external" href="https://wesmckinney.com/archives.html">Wes McKinney’s (pandas BDFL) blog</a></p></li>
<li><p><a class="reference external" href="http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/">Statistical analysis made easy in Python with SciPy and pandas DataFrames, by Randal Olson</a></p></li>
<li><p><a class="reference external" href="https://conference.scipy.org/scipy2013/tutorial_detail.php?id=109">Statistical Data Analysis in Python, tutorial videos, by Christopher Fonnesbeck from SciPy 2013</a></p></li>
<li><p><a class="reference external" href="https://nbviewer.ipython.org/github/twiecki/financial-analysis-python-tutorial/blob/master/1.%20Pandas%20Basics.ipynb">Financial analysis in Python, by Thomas Wiecki</a></p></li>
<li><p><a class="reference external" href="http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/">Intro to pandas data structures, by Greg Reda</a></p></li>
<li><p><a class="reference external" href="https://manishamde.github.io/blog/2013/03/07/pandas-and-python-top-10/">Pandas and Python: Top 10, by Manish Amde</a></p></li>
<li><p><a class="reference external" href="https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python">Pandas DataFrames Tutorial, by Karlijn Willems</a></p></li>
<li><p><a class="reference external" href="https://tutswiki.com/pandas-cookbook/chapter1/">A concise tutorial with real life examples</a></p></li>
</ul>
</section>
</section>
</div>

