# Data Science and Visualization (RUC F2024)

## Lecture 2: Exploratory Data Analysis (EDA)

# DataFrame in pandas

***DataFrame*** is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). More details can be found at https://pandas.pydata.org/pandas-docs/version/0.24/reference/frame.html

## 1. Construction and content

In [2]:
import pandas as pd

We can construct a DataFrame by specifying the **values**, **index** and **columns**. Both index and columns may be default, i.e., without specifications. 

Pay attention that the values are organized as a list of lists in the following example.

In [3]:
df_0 = pd.DataFrame([[1, 2], 
                   [4, 5], 
                   [7, 8]])
df_0

Unnamed: 0,0,1
0,1,2
1,4,5
2,7,8


Below we specify values and columns.

In [4]:
df = pd.DataFrame([[1, 2], 
                   [4, 5], 
                   [7, 8]],
                  columns=['max_speed', 'shield'])
df

Unnamed: 0,max_speed,shield
0,1,2
1,4,5
2,7,8


We may specify values, index and columns.

In [4]:
df = pd.DataFrame([[1, 2], 
                   [4, 5], 
                   [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])
df

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


Also, we can construct a DataFrame from a dict of *equal-length* lists.

In the following example, the data dict object defines the columns and values, where each column is a 'key-value' pairs but the value is composite, a series of values. And the index will be a range of integer numbers from 0 to 5.

In [5]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Furthermore, we can change the column ordering when we construct a DataFrame from the same dict object.

In [6]:
frame_2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
frame_2

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


Like in Series, we can make the index as labels.
And if we pass a column that isn’t contained in the dict, it will appear with missing values in the result:

In [7]:
frame_2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                       index=['one', 'two', 'three', 'four', 'five', 'six'])
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


## 2. To retrieve the content of a DataFrame

### Column retrieval

A column in a DataFrame can be retrieved as a *Series* object either by dict-like notation or by attribute:

In [8]:
df.shield

cobra         2
viper         5
sidewinder    8
Name: shield, dtype: int64

In [9]:
df['shield']

cobra         2
viper         5
sidewinder    8
Name: shield, dtype: int64

To get columns in a DataFrame format, we use put the column name(s) in a list:

In [10]:
df[['max_speed']]

Unnamed: 0,max_speed
cobra,1
viper,4
sidewinder,7


In [11]:
df[['max_speed', 'shield']]

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


### Row retrieval

However, we will get errors if we try to get rows in the same way as column retrieval:

In [12]:
df.cobra

AttributeError: 'DataFrame' object has no attribute 'cobra'

In [13]:
df['cobra']

KeyError: 'cobra'

In [14]:
df[['cobra', 'viper']]

KeyError: "None of [Index(['cobra', 'viper'], dtype='object')] are in the [columns]"

To get rows in a DataFrame, we need to use the **loc()** function. A single row is returned as a Series.

In [15]:
df.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

In [16]:
type(df.loc['viper'])

pandas.core.series.Series

If we want to return one or more rows as a *DataFrame*, we should use a list of index elements as the argument:

In [17]:
df.loc[['viper']]

Unnamed: 0,max_speed,shield
viper,4,5


Note the columns are also displayed (returned as a part of the retrieved result).

In [18]:
df.loc[['viper', 'sidewinder']]

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


The following won't work.

In [19]:
df.loc['viper', 'sidewinder']

KeyError: 'sidewinder'

We can also use the following way to specify the range of index for row retrival.

In [20]:
df.loc['cobra':'sidewinder']

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5
sidewinder,7,8


In this 'range' way, if the order of index element is wrong, then we get an empty DataFrame.

In [21]:
df.loc['sidewinder':'cobra']

Unnamed: 0,max_speed,shield



We can get the specific value of a cell by specifying the index (for row) and the column of the cell:

In [22]:
df.loc['cobra', 'shield']

2

By using arguments for both axises (index and columns), we can make **slicing**, i.e., retrieving values of specific rows and columns:

In [23]:
df.loc['cobra':'viper', 'max_speed']

cobra    1
viper    4
Name: max_speed, dtype: int64

Compare the examples above and below. See why the column (name) is also shown below.

In [24]:
df.loc['cobra':'viper', ['max_speed']]

Unnamed: 0,max_speed
cobra,1
viper,4


In [25]:
df.loc['cobra':'viper', ['max_speed', 'shield']]

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5


If you prefer to use integer values for indexing and column retrieval, you can use **iloc()** function. Note that the effect of *'cobra':'viper'* (inclusive) and *0:1* (exclusive)

In [26]:
df.iloc[0:2, [0, 1]]

Unnamed: 0,max_speed,shield
cobra,1,2
viper,4,5


More slicing examples are given below. Recall how we do slicing for sequences in Python. It's similar here.

In [27]:
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [28]:
frame_2.iloc[:, :3]

Unnamed: 0,year,state,pop
one,2000,Ohio,1.5
two,2001,Ohio,1.7
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


We can combine the slicing with some conditions to filter out rows that we don't want to see.

The following example retrieves all rows whose population is higher than 2 million.

In [29]:
frame_2.iloc[:, :3][(frame_2['pop'] > 2)]

Unnamed: 0,year,state,pop
three,2002,Ohio,3.6
four,2001,Nevada,2.4
five,2002,Nevada,2.9
six,2003,Nevada,3.2


*frame_2['pop'] > 2* returns a Series of boolean values, which can be used to filter out or select rows in a DataFrame.

In [30]:
frame_2['pop'] > 2

one      False
two      False
three     True
four      True
five      True
six       True
Name: pop, dtype: bool

The filtering condition can be composite.

In [31]:
frame_2.iloc[:, :3][(frame_2['pop'] > 2) & (frame_2['year']==2002)] #.count()

Unnamed: 0,year,state,pop
three,2002,Ohio,3.6
five,2002,Nevada,2.9


## 3. Transform

Given a DataFrame, we can transform its data by applying an operation to all its values.

In [32]:
df = pd.DataFrame({'A': range(3), 'B': range(4, 7)})
df

Unnamed: 0,A,B
0,0,4
1,1,5
2,2,6


In [33]:
# Here, lambda defines an 'inline' function that increase a given value by 1
df.transform(lambda x: x + 1)

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7


In [34]:
import numpy as np

df.transform(np.sqrt)

Unnamed: 0,A,B
0,0.0,2.0
1,1.0,2.236068
2,1.414214,2.44949


## 4. Basic statistics

* **shape**: (#rows, #columns)
* **info()**: data type of each column
* **describe()**: statistics of all numeric columns<br>
    * count, mean, std<br>
    * min, 25%, 50%, 75%, max (percentiles)<br>
* **nlargest()** and **nsmallest()** of a column

We've seen all except **nlargest()** and **nsmallest()**. By default, they show the **top-5** values (and index) on a column in *descending* and *ascending* way, respectively.


In [35]:
frame_2['pop'].nlargest()

three    3.6
six      3.2
five     2.9
four     2.4
two      1.7
Name: pop, dtype: float64

In [36]:
frame_2['pop'].nsmallest()

one     1.5
two     1.7
four    2.4
five    2.9
six     3.2
Name: pop, dtype: float64

In [37]:
frame_2['pop'].nlargest(3)

three    3.6
six      3.2
five     2.9
Name: pop, dtype: float64

In [38]:
frame_2['pop'].nsmallest(2)

one    1.5
two    1.7
Name: pop, dtype: float64

In [39]:
frame_2['year'].nsmallest(1)

one    2000
Name: year, dtype: int64

In [40]:
frame_2['year'].nlargest(1)

six    2003
Name: year, dtype: int64

*NB*: nlargest() and nsmallest() only works for a single column.

In [41]:
frame_2[['pop', 'year'].nlargest(1)

SyntaxError: unexpected EOF while parsing (<ipython-input-41-ed5d8473d614>, line 1)

## 5. Operation groupby

DataFrame's function **groupby(column)** split the data into several groups based on the different values of the given column that usally should be categorical. 

It should be used together with an *aggregate operation*, typically, 
* **count()**: applicable to categorical and numeric columns.
* **max()**: applicable to categorical and numeric columns. 
* **min()**: applicable to categorical and numeric columns.
* **mean()**: applicable to numeric columns only.
* **sum()**: applicable to numeric columns only.

By default, such an operation will be applied to **all columns** in the DataFrame object.

In [42]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
        'debt': [2, 3, 4, 2, 4, 6]
       }
df_2 = pd.DataFrame(data)
df_2

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,2
1,Ohio,2001,1.7,3
2,Ohio,2002,3.6,4
3,Nevada,2001,2.4,2
4,Nevada,2002,2.9,4
5,Nevada,2003,3.2,6


In [43]:
df_2.groupby(['year']).mean()

Unnamed: 0_level_0,pop,debt
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,2.0
2001,2.05,2.5
2002,3.25,4.0
2003,3.2,6.0


count() will also apply to categorical columns:

In [44]:
df_2.groupby(['year']).count()

Unnamed: 0_level_0,state,pop,debt
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,1,1,1
2001,2,2,2
2002,2,2,2
2003,1,1,1


We may group 'debt' values by 'year' and get the mean for each year:

In [45]:
df_2.groupby(['year'])['debt'].mean()

year
2000    2.0
2001    2.5
2002    4.0
2003    6.0
Name: debt, dtype: float64

In [46]:
df_2.groupby(['year'])['pop'].max()

year
2000    1.5
2001    2.4
2002    3.6
2003    3.2
Name: pop, dtype: float64

See the following two examples where we do aggregations in a nested way:

In [47]:
df_2.groupby(['state', 'year']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,pop,debt
state,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Nevada,2001,2.4,2
Nevada,2002,2.9,4
Nevada,2003,3.2,6
Ohio,2000,1.5,2
Ohio,2001,1.7,3
Ohio,2002,3.6,4


In [48]:
df_2.groupby(['year', 'state']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,pop,debt
year,state,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,Ohio,1.5,2
2001,Nevada,2.4,2
2001,Ohio,1.7,3
2002,Nevada,2.9,4
2002,Ohio,3.6,4
2003,Nevada,3.2,6


We can also groupby a numeric column, which however often doesn't make much sense.

In [49]:
df_2.groupby(['debt']).sum()

Unnamed: 0_level_0,year,pop
debt,Unnamed: 1_level_1,Unnamed: 2_level_1
2,4001,3.9
3,2001,1.7
4,4004,6.5
6,2003,3.2
