In [1]:
import pandas as pd

# What kind of data does pandas handle?

I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data.

To manually store data in a table, create a DataFrame. When using a python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame

In [10]:
table = pd.DataFrame({ "First Name": ['Paul', 'Flora', 'Charles'],
                        "Last Name": ['Owe', 'Owe', 'Owe'],
                      "Age" : [23, 55, 62],
                      "Gender": ['M', 'F', 'M'],
                      "NumericalValue": [4, 5, 7]
})

In [11]:
table

Unnamed: 0,First Name,Last Name,Age,Gender,NumericalValue
0,Paul,Owe,23,M,4
1,Flora,Owe,55,F,5
2,Charles,Owe,62,M,7


A DataFrame is a 2D data structure that can store different types (including characters, integers, floating point values, categorical data, and more) in columns. It is similar to a spreadsheet

- The table has 4 columns, each with a label.
- The columns contain varying data types

# Each column in a DataFrame is a Series
If I am interested in only working with the column 'Age' then the result is a pandas Series.

pandas.Series is a 1D ndarray with
- no column labels since it is just a single column of a DataFrame 
- row labels (axis labels)

In [6]:
age = table['Age']
age

0    23
1    55
2    62
Name: Age, dtype: int64

### Doing something with a DataFrame or Series

1. Numerical data columns

In [7]:
age.max()

62

In [8]:
table['Age'].max()

62

In [31]:
age.describe()

count     3.000000
mean     46.666667
std      20.792627
min      23.000000
25%      39.000000
50%      55.000000
75%      58.500000
max      62.000000
Name: Age, dtype: float64

### Basic statistics of the numerical data of my data table

The describe() method provides a quick overview of the numerical data in a DataFrame. As the Name and Gender columns are textual data, these are by default not taken into account by the describe() method.

The describe() method is an example of a pandas operation returning a pandas Series.

To return descriptions of categorical data as well use include ='all'


In [18]:
table.describe(include='all')

Unnamed: 0,First Name,Last Name,Age,Gender,NumericalValue
count,3,3,3.0,3,3.0
unique,3,1,,2,
top,Paul,Owe,,M,
freq,1,3,,2,
mean,,,46.666667,,5.333333
std,,,20.792627,,1.527525
min,,,23.0,,4.0
25%,,,39.0,,4.5
50%,,,55.0,,5.0
75%,,,58.5,,6.0


### Index of min/max values

In [30]:
import numpy as np
series = pd.Series(np.random.rand(5)) #Because this is a series, it should be 1D
series

0    0.482802
1    0.955417
2    0.160979
3    0.872252
4    0.025665
dtype: float64

In [32]:
series.idxmin() #returns the row index of the min value in this series

4

In [33]:
series.idxmax() # returns the row index of the max value

1

In [35]:
minmaxtup = series.idxmin(), series.idxmax() # return tuple

In [40]:
type(minmaxtup) #check data structure type

tuple