# Pandas Introduction - Part 1

### Agenda

Part 1
* Intro
    * Background
    * Creating a dataframes and a series
    * Displaying dataframes
* Moving data - imports & exports
* Selecting data
* Data types
* Modifying data
* Sorting data
* Math operations
* Descriptive statistics and profiling data

Reference material
* https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-39e811c81a0c
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-part-3-d5704b4b9116
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-part-4-c4216f84d388
* https://tomaugspurger.github.io/modern-1-intro.html


# Introducing Pandas and the DataFrame
* Background
* initializing with pd.DataFrame() and pd.Series()
* df.head(), df.tail()
* DataFrame from Python list or dictionary
* pd.read_csv()
* df.to_csv()

### What is `pandas`
`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

### Pandas DataFrame
    # a dataframe is basically a 2-dimensional table that can be composed of multiple data types
    # dataframes have an "index" property, which is a built-in logical key
        # the "index" can be useful for joins and plots
        # suggestion: don't worry too much about the index when starting out
        # IMO: sometimes pandas tries to be too smart with the index, which can lead to unpredictability/confusion

### Pandas Series
    # one-dimensional table (single column)
    # some pandas functionality only works or make sense in a 1-d contex


**DataFrame and Series examples**
![image.png](https://geo-python.github.io/site/_images/pandas-structures-annotated.png)

**DataFrame Terms**
![image.png](https://miro.medium.com/max/2992/1*aJJjpCHMNVyfM3UIJZyYcw.png)

### Pandas/NumPy background and use cases
* pandas dataframes are made up of blocks of numpy arrays. This allows for data sets with many types.
    * typically, numpy arrays must be comprised of objects of the same type
    * ex: an array of floats only, array of ints, etc)
* tl;dr, numpy and pandas are faster than standard Python for numeric computing, because they encourage minimal use of pointer/reference-based data types and explicit loops
    * numpy physically stores array elements near each other in RAM memory. 
        * Python lists store pointer references, which come with some overhead
    * in numpy "vectorized" operations, loops are pushed down to C level, which is faster than a normal Python loop
* In-memory (RAM-based) computation is generally faster than reading from disk
    * however, RAM is limited to "medium" data sizes (ex: 16-32gb RAM PC)
    * SQL databases are heavily optimized for large data sets, so a more common pattern is 
        to first aggregate or limit "big" data in SQL/otherDB
    * then perform analytical manipulations in a language like Python/R

![image.png](https://www.dataquest.io/wp-content/uploads/2019/01/df_blocks.png)

In [1]:
import os
import pandas as pd
import seaborn as sns

import utils
utils.prep_example_data()

Getting CSV files from seaborn in ./data/ directory




  gh_list = BeautifulSoup(http)


./data/anscombe.csv
./data/attention.csv
./data/brain_networks.csv
./data/car_crashes.csv
./data/diamonds.csv
./data/dots.csv
./data/exercise.csv
./data/flights.csv
./data/fmri.csv
./data/gammas.csv
./data/iris.csv
./data/mpg.csv
./data/planets.csv
./data/tips.csv
./data/titanic.csv


In [2]:
# list contents
print(os.listdir())
print('\n')
print(os.listdir('data'))


['.ipynb_checkpoints', 'data', 'oop.py', 'pandas-intro-part-1.ipynb', 'pandas-intro-part-2.ipynb', 'ReadMe.md', 'solutions', 'utils.py', '__pycache__']


['anscombe.csv', 'attention.csv', 'brain_networks.csv', 'car_crashes.csv', 'diamonds.csv', 'Divvy_Trips_2019_Q1.csv', 'Divvy_Trips_2019_Q2.csv', 'Divvy_Trips_2019_Q3.csv', 'Divvy_Trips_2019_Q4.csv', 'dots.csv', 'exercise.csv', 'flights.csv', 'fmri.csv', 'gammas.csv', 'iris.csv', 'iris_copy.csv', 'mpg.csv', 'planets.csv', 'tips.csv', 'titanic.csv']


In [3]:
# initialize empty dataframe
    # might consider doing this if you're appending/concatenating data sets together 
    # but don't want to explicitiy code the # of items if it may change 

df = pd.DataFrame()

In [4]:
type(df)

pandas.core.frame.DataFrame

In [5]:
ser = pd.Series()

  """Entry point for launching an IPython kernel.


In [6]:
type(ser)

pandas.core.series.Series

In [7]:
# from data as Python list of lists
headers = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
data_lol = [[5.1, 3.5, 1.4, 0.2, 'setosa'],
         [4.9, 3.0, 1.4, 0.2, 'setosa'],
         [4.7, 3.2, 1.3, 0.2, 'setosa'],
         [4.6, 3.1, 1.5, 0.2, 'setosa'],
         [5.0, 3.6, 1.4, 0.2, 'setosa'],
         [5.4, 3.9, 1.7, 0.4, 'setosa'],
         [4.6, 3.4, 1.4, 0.3, 'setosa'],
         [5.0, 3.4, 1.5, 0.2, 'setosa'],
         [4.4, 2.9, 1.4, 0.2, 'setosa'],
         [4.9, 3.1, 1.5, 0.1, 'setosa']]


df = pd.DataFrame(data_lol, columns=headers)
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [8]:
# from Python dictionary
# 5 rows of data (items 0-4) with 5 columns

data_dict = {0: {'petal_length': 1.4,
             'petal_width': 0.2,
             'sepal_length': 5.1,
             'sepal_width': 3.5,
             'species': 'setosa'},
         1: {'petal_length': 1.4,
             'petal_width': 0.2,
             'sepal_length': 4.9,
             'sepal_width': 3.0,
             'species': 'setosa'},
         2: {'petal_length': 1.3,
             'petal_width': 0.2,
             'sepal_length': 4.7,
             'sepal_width': 3.2,
             'species': 'setosa'},
         3: {'petal_length': 1.5,
             'petal_width': 0.2,
             'sepal_length': 4.6,
             'sepal_width': 3.1,
             'species': 'setosa'},
         4: {'petal_length': 1.4,
             'petal_width': 0.2,
             'sepal_length': 5.0,
             'sepal_width': 3.6,
             'species': 'setosa'}}

df = pd.DataFrame(data_dict).T
df

Unnamed: 0,petal_length,petal_width,sepal_length,sepal_width,species
0,1.4,0.2,5.1,3.5,setosa
1,1.4,0.2,4.9,3.0,setosa
2,1.3,0.2,4.7,3.2,setosa
3,1.5,0.2,4.6,3.1,setosa
4,1.4,0.2,5.0,3.6,setosa


In [9]:
# initialize dataframe by reading from data source
    # pd.read_csv, pd.read_excel, pd.read_sql, etc

df = pd.read_csv('data/iris.csv')

In [10]:
# exporting
df.to_csv('./data/iris_copy.csv')

# need to import openpyxl
# df.to_excel('./data/iris_copy.xlsx')

# list contents in data directory
os.listdir('data')

['anscombe.csv',
 'attention.csv',
 'brain_networks.csv',
 'car_crashes.csv',
 'diamonds.csv',
 'Divvy_Trips_2019_Q1.csv',
 'Divvy_Trips_2019_Q2.csv',
 'Divvy_Trips_2019_Q3.csv',
 'Divvy_Trips_2019_Q4.csv',
 'dots.csv',
 'exercise.csv',
 'flights.csv',
 'fmri.csv',
 'gammas.csv',
 'iris.csv',
 'iris_copy.csv',
 'mpg.csv',
 'planets.csv',
 'tips.csv',
 'titanic.csv']

In [11]:
# when un-sure on options available, use built-in help() docs or check library docs
# ex: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

# read_csv has a lot of optional parameters
    # for setting data types (ex: for efficiency and formatting purposes)
    # specifying delimeter
    # default behavior with Null values

help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file in

In [12]:
# use .head() or .tail() to inspect first or last 5 records
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [13]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [14]:
# you can override the default to show more or fewer records

df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [15]:
df.index

RangeIndex(start=0, stop=150, step=1)

Note  `df.reset_index()` vs. `df = df.reset_index()`.

Pandas operations do over-ride your data by default ("in-place" operations).

You must re-assign the variable (IMO: more idiomatic) or use inplace=True paramater.

The following are the same. IMO: first method is more "Pythonic"/consistent with generic Python re-assignment.

```python
df = df.reset_index()
df.reset_index(inplace=True)
```

In [16]:
# index is the logical key (/granularity of the dataset) 
# that can simplify some joins and plotting purposes
# if it's a column you plan to do any analysis on, I'd suggest pulling it out by default

df = df.reset_index()

In [17]:
df

Unnamed: 0,index,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.1,3.5,1.4,0.2,setosa
1,1,4.9,3.0,1.4,0.2,setosa
2,2,4.7,3.2,1.3,0.2,setosa
3,3,4.6,3.1,1.5,0.2,setosa
4,4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...,...
145,145,6.7,3.0,5.2,2.3,virginica
146,146,6.3,2.5,5.0,1.9,virginica
147,147,6.5,3.0,5.2,2.0,virginica
148,148,6.2,3.4,5.4,2.3,virginica


In [18]:
# other handy dataframe methods

# df.to_clipboard()
# df.to_clipboard(index=False)
# df.read_clipboard()

In [19]:
# get column names

# df.columns
df.columns.tolist()

['index',
 'sepal_length',
 'sepal_width',
 'petal_length',
 'petal_width',
 'species']

In [20]:
# if you need to convert dataframes back to python objects for some reason...

df.values

array([[0, 5.1, 3.5, 1.4, 0.2, 'setosa'],
       [1, 4.9, 3.0, 1.4, 0.2, 'setosa'],
       [2, 4.7, 3.2, 1.3, 0.2, 'setosa'],
       [3, 4.6, 3.1, 1.5, 0.2, 'setosa'],
       [4, 5.0, 3.6, 1.4, 0.2, 'setosa'],
       [5, 5.4, 3.9, 1.7, 0.4, 'setosa'],
       [6, 4.6, 3.4, 1.4, 0.3, 'setosa'],
       [7, 5.0, 3.4, 1.5, 0.2, 'setosa'],
       [8, 4.4, 2.9, 1.4, 0.2, 'setosa'],
       [9, 4.9, 3.1, 1.5, 0.1, 'setosa'],
       [10, 5.4, 3.7, 1.5, 0.2, 'setosa'],
       [11, 4.8, 3.4, 1.6, 0.2, 'setosa'],
       [12, 4.8, 3.0, 1.4, 0.1, 'setosa'],
       [13, 4.3, 3.0, 1.1, 0.1, 'setosa'],
       [14, 5.8, 4.0, 1.2, 0.2, 'setosa'],
       [15, 5.7, 4.4, 1.5, 0.4, 'setosa'],
       [16, 5.4, 3.9, 1.3, 0.4, 'setosa'],
       [17, 5.1, 3.5, 1.4, 0.3, 'setosa'],
       [18, 5.7, 3.8, 1.7, 0.3, 'setosa'],
       [19, 5.1, 3.8, 1.5, 0.3, 'setosa'],
       [20, 5.4, 3.4, 1.7, 0.2, 'setosa'],
       [21, 5.1, 3.7, 1.5, 0.4, 'setosa'],
       [22, 4.6, 3.6, 1.0, 0.2, 'setosa'],
       [23, 5.1, 3.3,

In [21]:
df.values.tolist()

[[0, 5.1, 3.5, 1.4, 0.2, 'setosa'],
 [1, 4.9, 3.0, 1.4, 0.2, 'setosa'],
 [2, 4.7, 3.2, 1.3, 0.2, 'setosa'],
 [3, 4.6, 3.1, 1.5, 0.2, 'setosa'],
 [4, 5.0, 3.6, 1.4, 0.2, 'setosa'],
 [5, 5.4, 3.9, 1.7, 0.4, 'setosa'],
 [6, 4.6, 3.4, 1.4, 0.3, 'setosa'],
 [7, 5.0, 3.4, 1.5, 0.2, 'setosa'],
 [8, 4.4, 2.9, 1.4, 0.2, 'setosa'],
 [9, 4.9, 3.1, 1.5, 0.1, 'setosa'],
 [10, 5.4, 3.7, 1.5, 0.2, 'setosa'],
 [11, 4.8, 3.4, 1.6, 0.2, 'setosa'],
 [12, 4.8, 3.0, 1.4, 0.1, 'setosa'],
 [13, 4.3, 3.0, 1.1, 0.1, 'setosa'],
 [14, 5.8, 4.0, 1.2, 0.2, 'setosa'],
 [15, 5.7, 4.4, 1.5, 0.4, 'setosa'],
 [16, 5.4, 3.9, 1.3, 0.4, 'setosa'],
 [17, 5.1, 3.5, 1.4, 0.3, 'setosa'],
 [18, 5.7, 3.8, 1.7, 0.3, 'setosa'],
 [19, 5.1, 3.8, 1.5, 0.3, 'setosa'],
 [20, 5.4, 3.4, 1.7, 0.2, 'setosa'],
 [21, 5.1, 3.7, 1.5, 0.4, 'setosa'],
 [22, 4.6, 3.6, 1.0, 0.2, 'setosa'],
 [23, 5.1, 3.3, 1.7, 0.5, 'setosa'],
 [24, 4.8, 3.4, 1.9, 0.2, 'setosa'],
 [25, 5.0, 3.0, 1.6, 0.2, 'setosa'],
 [26, 5.0, 3.4, 1.6, 0.4, 'setosa'],
 [27, 5.2, 

In [22]:
# chaining commands

df.head().to_dict()
# df.to_dict(orient='index')
# df.to_dict(orient='rows')


{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
 'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa'}}

### Inspecting a dataset without pandas to determine delimeter or size

In [23]:
# previewing the delimeter without manually opening the file (ex: may be too big for text editor, etc)

# we can see it's comma-delimited
with open('./data/planets.csv') as f:
    print(f.readline())
    print(f.readline())

method,number,orbital_period,mass,distance,year

Radial Velocity,1,269.3,7.1,77.4,2006



In [24]:
# getting the size using a generator (safe for limited RAM)
# next() throws StopIteration error once all records are "consumed"
n_rows = 0
with open('./data/planets.csv') as f:
    while True:
        try:
            next(f)
            n_rows += 1
        except StopIteration:
            break

# includes header
print(f'There are {n_rows} rows in planets.csv')

There are 1036 rows in planets.csv


<font color='blue'> 
### Exercises
* Create this dataframe from a Python list of lists or a dictionary:
```
|    |   a |   b |
|---:|----:|----:|
|  0 |   1 |   4 |
|  1 |   2 |   5 |
|  2 |   3 |   6 |
```

* load the ./data/planets.csv data set using `pd.read_csv()` into a dataframe named `planets_df`
* display the first 5 records
* dispay the first 10 records
* display the last 5 records
* print each file name in the `/data` directory using a `os.listdir()` and a for loop
* bonus: display the filename and column names for each file in `data/`
    * hint: loop through each dataframe in ./data/* directory using `os.listdir`
    * read the dataset into a dataframe
    * use df.columns

</font>

In [25]:
# %load ./solutions/01-01.py

# Selecting Data

In [26]:
iris = pd.read_csv('./data/iris.csv')

### Selecting with "label-based" method (column name as string)

In [27]:
# notice how passing one string returns a Series. Put it in a list to get a dataframe back instead.
print(type(iris['sepal_length']))

iris['sepal_length']

<class 'pandas.core.series.Series'>


0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [28]:
iris[['sepal_length']].head()

Unnamed: 0,sepal_length
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0


In [29]:
iris[['sepal_length', 'species']].head()

Unnamed: 0,sepal_length,species
0,5.1,setosa
1,4.9,setosa
2,4.7,setosa
3,4.6,setosa
4,5.0,setosa


### Selecting (indexing) data with .loc vs .iloc
* .loc is label-based
* .iloc is index location based on position
* distinction may be important if you have column names that are numbers (good practice to avoid)


In [30]:
# df.loc[[<rows>], [<columns>]]

# ':' short hand for all rows
iris.loc[:, ['species']]

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa
...,...
145,virginica
146,virginica
147,virginica
148,virginica


In [31]:
# 2 to 4th position
# notice how index position 4 is returned unlike Python list slice behavior (excludes end #)
iris.loc[2:4, ['species']]

Unnamed: 0,species
2,setosa
3,setosa
4,setosa


In [32]:
# Python list slice demoe
x_list = [0, 1, 2, 3, 4]
x_list[2:4]

[2, 3]

# "dot" notation

* You might see this in others' code. Avoid this pattern when possible. 
* Leads to trouble for column names that do not support Python methods.

```python
# invalid
df.999
df.column with space
```

In [33]:
iris.species

0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [34]:
iris.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

# Boolean Indexing/Filtering
* use of boolean indexes to filter
* use `&` (and) operator
* use of `|` (or) pipe operator
* use of `Series.isin(<iterable>)` and `not`

In [35]:
# reset
iris = sns.load_dataset('iris')
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [36]:
iris['species'] == 'virginica'

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: species, Length: 150, dtype: bool

In [37]:
(iris['species'] == 'virginica').sum()

50

In [38]:
# give me all records where species = virginica 

virginica = iris[iris['species'] == 'virginica']
virginica.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica


### Multiple conditions

This

```python
virginica_bool = df['species'] == 'virginica'
pw_greater_than_2_bool = df['petal_width'] > 2
filtered = df[virginica_bool & pw_greater_than_2_bool]
```
is the same as

```python
filtered = df[(df['species'] == 'virginica') & (df['petal_width'] > 2)]
```

Former has better readability when conditions get complicated.

In [39]:
virginica_bool = iris['species'] == 'virginica'
pw_greater_than_2_bool = iris['petal_width'] > 2

virg_and_gt2 = virginica_bool & pw_greater_than_2_bool 
filtered_AND = iris[virginica_bool & pw_greater_than_2_bool]
filtered_OR = iris[virginica_bool | pw_greater_than_2_bool]
# filtered_AND = df[virginica_bool & pw_greater_than_2_bool]

In [40]:
# df.shape returns a tuple: (n_rows, n_columns)

filtered_AND.shape

(23, 5)

In [41]:
filtered_OR.shape

(50, 5)

In [42]:
filtered_OR.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica


In [43]:
# ~ tilda --> not operator
# wrap with parentheses

# where species is not `setosa`
iris[~(iris['species']=='setosa')]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [44]:
# where species is in [list]
iris[iris['species'].isin(['virginica', 'versicolor'])]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### Selecting by data type

In [45]:
# selecting all floats
iris.select_dtypes(float).head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [46]:
# selecting all objects/strings

iris.select_dtypes('object').head()

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa


<font color='blue'> 
### Exercises
* load the planets.csv data set into a dataframe
* select the `method` column and store in a dataframe called `method_df`
* select both the `mass` and `distance` columns
* use a boolean index to select records where the value 'method' is 'Imaging'
* use a boolean index with mulitple conditions to records where the value 'method' is 'Imaging'  and the year is 2011
* select all columns of integer data type using `df.selectdtypes('int')`
</font>

### Math operations
* df.mean()
* df.sum()
* df.std()
* column / vector math

In [47]:
# on whole dataframe
iris.mean()

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

In [48]:
# on select column
iris['sepal_length'].mean()

5.843333333333334

In [49]:
iris[['sepal_length', 'sepal_width']].mean()

sepal_length    5.843333
sepal_width     3.057333
dtype: float64

In [50]:
iris.sum()

sepal_length                                                876.5
sepal_width                                                 458.6
petal_length                                                563.7
petal_width                                                 179.9
species         setosasetosasetosasetosasetosasetosasetosaseto...
dtype: object

In [51]:
iris.std()

sepal_length    0.828066
sepal_width     0.435866
petal_length    1.765298
petal_width     0.762238
dtype: float64

In [52]:
# can do math on columns
# this is also known as a "vectorized" function since we are operating with the whole column/row
    # as opposed to an explicit row-by-row loop 
    # vectorized functions push the loops down to much faster C level (standard Python written C language)

iris['sepal_length'] * iris['sepal_width']

0      17.85
1      14.70
2      15.04
3      14.26
4      18.00
       ...  
145    20.10
146    15.75
147    19.50
148    21.08
149    17.70
Length: 150, dtype: float64

### Operations across columns

Ex: `df.sum(axis='columns')`

In [53]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [54]:
# get row 3
# columns: petal_lenth, petal_width

one_row = iris.loc[[3], ['petal_length', 'petal_width']]
one_row

Unnamed: 0,petal_length,petal_width
3,1.5,0.2


In [55]:
# use axis='columns' or axis=1
    # axis=1 applies across columns (think shape of 1 looks tall like column)
one_row.sum(axis='columns')

3    1.7
dtype: float64

In [56]:
# more idiomatic way to sum across columns
one_row['petal_length'] + one_row['petal_width']

3    1.7
dtype: float64

In [57]:
# default behavior for math is to apply across rows
# one_row.sum()

one_row.sum(axis='rows')

petal_length    1.5
petal_width     0.2
dtype: float64

<font color='blue'> 
### Exercises
* load the flights.csv data set into a dataframe
* identify the total number of passengers for the year 1960
    * hint: boolean index for year condition, use .loc to get passenger column, then .sum()
* identify the average/mean monthly number of passengers for the year 1955

</font>

In [58]:
# %load ./solutions/01-02.py

### Sorting

`df.sort_values(by=column_list, ascending=list_of_bools)`

In [59]:
# primary sort arguments
    # pass list of columns to sort by
    # pass True/False bool for order
# does not sort in place! must re-assign
    # ex: `df = df.sort_values(...)
    
sorted_df = iris.sort_values(by=['species', 'petal_length'], ascending=[False, True])
sorted_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
106,4.9,2.5,4.5,1.7,virginica
126,6.2,2.8,4.8,1.8,virginica
138,6.0,3.0,4.8,1.8,virginica
121,5.6,2.8,4.9,2.0,virginica
123,6.3,2.7,4.9,1.8,virginica


<font color='blue'> 
### Exercises
* load the mpg.csv data set
* return a dataframe of the model name sorted in order of descending weight
* return a dataframe sorted by `origin` and model name (`name`) in ascending order for both

</font>

In [60]:
# %load ./solutions/01-03.py

### Profiling data
* df.head()
* df.describe()
* df.value_counts()
* df.dtypes
* df.info()

In [61]:
# inspect with .head()/.tail()

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [62]:
# df.describe() --> descriptive statistics
# defaults to numeric columns. Use include='all' to include counts for categorical values too

iris.describe()
# df.describe(include='all')

# transpose
# df.describe(include='all').T

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [63]:
iris['species'].value_counts()

versicolor    50
setosa        50
virginica     50
Name: species, dtype: int64

In [64]:
iris['species'].value_counts(normalize=True)

versicolor    0.333333
setosa        0.333333
virginica     0.333333
Name: species, dtype: float64

In [65]:
iris['species'].value_counts(dropna=False)

versicolor    50
setosa        50
virginica     50
Name: species, dtype: int64

In [66]:
# data types
    # 'object'/string data types are heavier on memory usage
    # SQL principles apply -- aggregate early before adding labels
    # in later session, we can show how to convert to pandas Categorical data type for efficiency

iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [67]:
iris.info()
# iris.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


<font color='blue'> 
### Exercises
* load the diamonds.csv data set
* identify the distribution for the 'cut' variable with counts by value
* identify the percentage distribution
* add the `dropna=False` argument to make sure there are no null values

</font>

In [68]:
# %load ./solutions/01-04.py

# Modifying data

### New columns with `df['new_col'] = <some operation>`

In [69]:
iris = sns.load_dataset('iris').head()
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [70]:
iris['species_upper'] = iris['species'].str.upper()
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_upper
0,5.1,3.5,1.4,0.2,setosa,SETOSA
1,4.9,3.0,1.4,0.2,setosa,SETOSA
2,4.7,3.2,1.3,0.2,setosa,SETOSA
3,4.6,3.1,1.5,0.2,setosa,SETOSA
4,5.0,3.6,1.4,0.2,setosa,SETOSA


In [71]:
iris['sepal_area'] = iris['sepal_length'] * iris['sepal_width']
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_upper,sepal_area
0,5.1,3.5,1.4,0.2,setosa,SETOSA,17.85
1,4.9,3.0,1.4,0.2,setosa,SETOSA,14.7
2,4.7,3.2,1.3,0.2,setosa,SETOSA,15.04
3,4.6,3.1,1.5,0.2,setosa,SETOSA,14.26
4,5.0,3.6,1.4,0.2,setosa,SETOSA,18.0


### Renaming columns with `df.columns = column_list` or `df = df.rename(columns=<dict>)`

In [72]:
columns = iris.columns.tolist()
columns

['sepal_length',
 'sepal_width',
 'petal_length',
 'petal_width',
 'species',
 'species_upper',
 'sepal_area']

In [73]:
# rename group of columns
iris.columns = ['SEPAL_LENGTH', 'SEPAL_WIDTH', 'PETAL_LENGTH', 'PETAL_WIDTH', 'SPECIES', 'SPECIES_UPPER', 'SEPAL_AREA']
iris

Unnamed: 0,SEPAL_LENGTH,SEPAL_WIDTH,PETAL_LENGTH,PETAL_WIDTH,SPECIES,SPECIES_UPPER,SEPAL_AREA
0,5.1,3.5,1.4,0.2,setosa,SETOSA,17.85
1,4.9,3.0,1.4,0.2,setosa,SETOSA,14.7
2,4.7,3.2,1.3,0.2,setosa,SETOSA,15.04
3,4.6,3.1,1.5,0.2,setosa,SETOSA,14.26
4,5.0,3.6,1.4,0.2,setosa,SETOSA,18.0


In [74]:
# not modified in-place

iris.rename(columns={'SEPAL_LENGTH': 'sepal_length'})


Unnamed: 0,sepal_length,SEPAL_WIDTH,PETAL_LENGTH,PETAL_WIDTH,SPECIES,SPECIES_UPPER,SEPAL_AREA
0,5.1,3.5,1.4,0.2,setosa,SETOSA,17.85
1,4.9,3.0,1.4,0.2,setosa,SETOSA,14.7
2,4.7,3.2,1.3,0.2,setosa,SETOSA,15.04
3,4.6,3.1,1.5,0.2,setosa,SETOSA,14.26
4,5.0,3.6,1.4,0.2,setosa,SETOSA,18.0


### Modifying data types with `df['column'].astype(< type >)`

In [75]:
iris = sns.load_dataset('iris').head()

iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [76]:
iris['sepal_length'].astype(int)

0    5
1    4
2    4
3    4
4    5
Name: sepal_length, dtype: int32

In [77]:
iris[['sepal_length', 'sepal_width']].astype(int)

Unnamed: 0,sepal_length,sepal_width
0,5,3
1,4,3
2,4,3
3,4,3
4,5,3


In [78]:
# early changes did not modify in-place
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [79]:
# need to re-assign for changes to take effect
iris[['sepal_length', 'sepal_width']] = iris[['sepal_length', 'sepal_width']].astype(int)

In [80]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5,3,1.4,0.2,setosa
1,4,3,1.4,0.2,setosa
2,4,3,1.3,0.2,setosa
3,4,3,1.5,0.2,setosa
4,5,3,1.4,0.2,setosa


In [81]:
# perhaps you need to convert a year to a string for formatting operations

year_df = pd.DataFrame([2020, 2019, 2018, 2017, 2016], columns=['year'])
year_df['year'] = year_df['year'].astype(str)

print(year_df.dtypes)
year_df

year    object
dtype: object


Unnamed: 0,year
0,2020
1,2019
2,2018
3,2017
4,2016


<font color='blue'> 
### Exercises
* load the mpg.csv data set
* create a column called 'pw_ratio' and calculate the power to 10-pound weight ratio
    * power-to-10lb-weight ratio = (horsepower / weight)*10
* sort the data set from highest to lowest power-weight ratio
* create a dataframe for each country of origin (fitler using boolean indexing)
     * use value_counts() on 'origin' field to determine the unique values
* for each origin country, determine the model with the highest power-weight ratio and ratio
   * 
   * this is a manual application Group By
   * check your answer against the result of `mpg_df.groupby(['origin'])[['pw_ratio']].max().T`
   * or check against:
    ```python
    max_ratio_list = mpg.groupby(['origin'])[['pw_ratio']].max()['pw_ratio']
    mpg.loc[mpg['pw_ratio'].isin(max_ratio_list), ['origin', 'name', 'pw_ratio']]
    ```

</font>

In [82]:
# %load ./solutions/01-05.py

In [83]:
# max_ratio_list = mpg.groupby(['origin'])[['pw_ratio']].max()['pw_ratio']
# mpg.loc[mpg['pw_ratio'].isin(max_ratio_list), ['origin', 'name', 'pw_ratio']]

# End