# Introduction to pandas

Python module to work with tabular data (multidimensional), in the form of mainly **dataframes**. For extra tutorials [see here](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).

We start by loading the module and create a first dataframe filled with random values:


In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df

A direct print results in a less formatted output:

In [None]:
print(df)

With the dataframe attribute ``.dtypes`` we can get the data type of each column in a Pandas dataframe:

In [None]:
df.dtypes

The dataframe method `head()` will show the first few rows of the dataframe; the attribute `.columns` returns the column names.

In [None]:
df.head()

In [None]:
df.columns

With the method `to_numpy()` we convert a dataframe to an array; then we can slice it as we already saw for numpy arrays

In [None]:
arr = df.to_numpy()
print(arr)

In [None]:
arr[:,1] ## get second column

It is possible to create a data frame from a standard python dictionary, or from another dataframe

In [None]:
my_dict = {
    'name' : ['Annah', 'Giorgia', 'Luca'],
    'surname' : ['Montana', 'Smoth', 'Guerri'],
    'age' : [12, 20, 88]
}
df_from_dict = pd.DataFrame(my_dict)

df_from_dict

In [None]:
df_from_df = pd.DataFrame(df_from_dict)
df_from_df

# Slicing, in several ways

You can also slice directly the Pandas dataframe:

1. by column name

In [None]:
df['A']

2. by slicing by rows

In [None]:
df[0:2]

3a. by column names using the `.loc` syntax

In [None]:
print(df.loc[0:3, :])
df.loc[:, ['B','C']]

3b. by row names using the `.loc` syntax (passing through an index)

(see the doc for [.set_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) method for more details on setting indexes)

In [None]:
#adding a new column
df['Fruits'] = ['Apple', 'Banana', 'Coconut', 'Date', 'Elderberry', 'Fig', 'Grape', 'Juniper']

#setting the column as new index
df = df.set_index('Fruits')

#taking a look
df

In [None]:
#extracting two rows
df.loc[['Apple', 'Banana'], :]

4. by position using the `.iloc` syntax (this is similar to numpy slicing)

In [None]:
df.iloc[0:2,2:4]

# Reading from a text file

It is possible to read a Pandas dataframe directly from a .csv file, either local or accessible from the web. We'll use the function [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) which in its simplest form looks like this:

In [None]:
FILE_URL = 'https://raw.githubusercontent.com/ne1s0n/dataviz_python/main/resources/planets.csv'
planets = pd.read_csv(FILE_URL)

planets

Taking a look at the types we can discover a few interesting things. What happened to Rotation and Axis tilt?

In [None]:
planets.dtypes

---

# ASSIGNMENT (reading csv)

Take a look at the documentation for [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to read the data again, but this time you'll need to: 

* specify that the column with the names of the planets needs to be used as index, so that we can then invoke something like `planets.loc['Venus', :]` 
* find a way to not read the last line, since Pluto is not a planet anymore :(

---

In [None]:
# your solution here

## What about other formats?

There's several options, depending if you need to read:

* spreadsheet (excel, openoffice): use [read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
* xml: use [read_xml()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_xml.html)
* json (java script object notation): use [read_json()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)
* text, but fixed width: use [read_fwf()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html)
* ... and many more

In general pandas is very well suited to import table-like pieces of data, so before writing your own import function take a look at what's available.

# Doing statistics on dataframes

A large number of methods compute descriptive statistics on dataframe. Most of these are aggregations like [sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html), [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html), but some of them, like [cumsum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html), produce an object of the same size. 

Generally speaking, these methods take an axis argument, easily specified by an integer (axis=0 makes the operation go along the rows, axis=1 along the columns and so forth).

In [None]:
# a dataframe of random integers
df = pd.DataFrame(np.random.randint(100, size = (10, 4)), columns=['A', 'B', 'C', 'D'])
df

In [None]:
#mean - default is by column
df.mean()

In [None]:
#mean - specifying by row
df.mean(axis = 1)

In [None]:
#can you guess what happens here?
df.mean(axis = 2)

In [None]:
#cumulative sum along the default axis
df.cumsum()

--- 

# ASSIGNMENT! (pandas replacement)

* create a pandas dataframe with 5 rows, 15 columns, and filled with random numbers between 1 and 10
* search the Pandas dataframe documentation for a method to replace all values equal to 8 with the number 888

---

In [None]:
# your solution here

# Grouping

A very important pattern of operations when working with dataframes is grouping, also called "split-apply-combine". The idea is to split the dataframe rows depending on some value of the columns, and then apply some statistics operation on it. An example will clarify.

In [None]:
#the planets example
FILE_URL = 'https://raw.githubusercontent.com/ne1s0n/dataviz_python/main/resources/planets.csv'
planets = pd.read_csv(FILE_URL)

#focusing on just two columns, for simplicity
my_df = planets[['Type', 'Mass (x ME)' , 'Diameter (km)']]

#group by Type, compute the average of everything else
my_df.groupby('Type').mean()

The last command did several things:

* it split the dataframe in two groups using the values of the column "Type"
* it computed the required function (`.mean()`) on the split dataframe
* it recreated the index using the grouping column
* it returned a new dataframe. Notice that the original one is untouched

We could have avoided using the small `my_df` dataframe with a slightly more complex command:

In [None]:
#can you guess what's going on?
planets.groupby('Type')[['Mass (x ME)', 'Diameter (km)']].mean().reset_index()

It's possible to easily group by more than one column:

In [None]:
planets.groupby(['Type', 'Discovered'])[['Mass (x ME)', 'Diameter (km)']].mean().reset_index()

---

# ASSIGNMENT! (dataframe aggregate)

Count how many planets were discovered by babilonian astronomers using the `.groupby()` method

---

In [None]:
#your solution here

# Out of the box descriptions

There's a couple of functions that produce a quick description of a dataframe and can be useful to make sense of the data. They are [.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) and [.describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)



In [None]:
planets.info()

In [None]:
#aren'ts we missing something?
planets.describe()

# Miscellanea

Stuff that students asked

## Substituting values in a slice

In [None]:
import pandas as pd
FILE_URL = 'https://raw.githubusercontent.com/ne1s0n/dataviz_python/main/resources/planets.csv'
planets = pd.read_csv(FILE_URL)

#a selector for some rows
sel = planets.loc[:, 'Discovered'] == "Babilonian astronomers, 2nd millennium BC"

#replacing the Moons column for Bab planets with a fixed value
planets.loc[sel, 'Moons'] = -10
planets

In [None]:
#replacing the Moons column for Bab planets with a value from another column
planets.loc[sel, 'Moons'] = planets.loc[sel, 'Type']
planets

## Precision when printing (number of digits)

In [None]:
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html
pd.set_option("display.precision", 5)
planets.describe()

Unnamed: 0,Mass (x ME),Diameter (km),Density (g/cm3),Magnetic field
count,9.0,9.0,9.0,9.0
mean,49.62794,43903.77778,2.98811,2131.08893
std,105.12596,51415.39847,2.01337,6166.32043
min,0.0022,2376.0,0.687,0.0
25%,0.107,6779.0,1.326,0.0
50%,1.0,12742.0,1.854,1.0
75%,17.147,50724.0,5.243,47.0
max,317.83,139822.0,5.515,18568.0


## Cheat sheet

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf