# Introduction to `pandas`

We import the `pandas` package as `pd` (A typical "abbreviation." See the book _Pandas_ by Wes McKinney.)

In [None]:
import pandas as pd

The `pandas` packages has two main datatypes. One data type is `Series`.

In [None]:
series = pd.Series(['BMW', 'Toyota', 'Honda'])
series

Series are **one-dimensional**

Create a second `Series` for later use.

In [None]:
colors = pd.Series(['Red', 'Blue', 'White'])
colors

Data frames are **two-dimensional**

In [None]:
car_data = pd.DataFrame({'Car make': series, 'Color': colors})
car_data

One can import a data from `.csv` file

In [None]:
car_sales = pd.read_csv('./car-sales.csv')
car_sales

## "Anatomy" of a dataframe

- Left-most column, when printed, is the index. (0-9) above
- Rows (axis=0)
- Columns (axis=1)
- See [Anatomy of a Dataframe](./Anatomy of a Dataframe.png)

Even though we've not changed the car sales data, we export it to
demonstrate the `pandas` export functionality.

In [None]:
car_sales.to_csv('exported_car_sales.csv')

 As an alternative, we could export the file in Excel format using
car_sales.to_excel('exported_car_sales.xlsx'). However, this functionality
requires the module `openpyxl`.

Reading a previously exported `.csv` exhibits surprising behavior.

In [None]:
exported_car_sales = pd.read_csv('exported_car_sales.csv')
exported_car_sales

That is, reading a `pandas` data frame from a .`csv` file created by
exporting a `DataFrame` previously as a `.csv` file "adds" a 
"additional" column name "Unnamed: 0". 

This behaviour is simply because of the way we originally 
**exported** the `DataFrame`.

The `DataFrame.to_csv()` method accepts a parameter, `index`.
`index` is a boolean parameter: if its value is truthy, the
default, `pandas` will include the `Index` in the export.
However, if `index` is falsy, the export will **not**
contain the data frame `Index`.

In [None]:
# Including the optional parameter, `index`, with a falsy value
# instructs `to_csv()` to **not** export the `Index` of the 
# data frame, (I suspect the same parameter could be used 
# for a `pandas` `Series`, but I could be wrong. :) )

# Consequently, reading the exported data **will not** have
# the "mystery" column (named something like "Unnamed: 0".
car_sales.to_csv('exported_car_sales.csv', index=False)
exported_car_sales = pd.read_csv('exported_car_sales.csv')
exported_car_sales

## Describe data

In [None]:
# An attribute (basically reads data)
car_sales.dtypes

# A function executes code.
# car_sales.to_csv()

In [None]:
# Another attribute: the columns of the data frame
car_sales.columns

A `pandas` `Index` is basically a list of values of a specified type

In [None]:
car_columns = car_sales.columns
car_columns

In [None]:
car_columns[0], car_columns[2:-1]

In [None]:
# What about an `index` property
car_sales.index

In [None]:
car_sales

### DataFrame functions

In [None]:
# Summary statistics about **numeric** fields
car_sales.describe()

Note that the `price` column is of `dtype` `object` (not a number). 
Because the `price` column is a **non-numeric** column (neither 
`int` nor `float`, it is **not** included in the summary statistics
calculated by `describe()`.

In [None]:
car_sales.info()

The function `DataFrame.info()` provides a summary of the **data** of
each column in the `DataFrame`.

In [None]:
try:
    car_sales.mean() # Average of **numeric** columns
except:
    print('Hmmm. Apparently a change in behavior from the videos.')

In [None]:
car_sales[['Odometer (KM)', 'Doors']].mean()

The function `mean()` also works on appropriate instances
of `DataFrame.Series`.

In [None]:
# Pretending our car prices are of dtype `int` and not of dtype `object`
car_prices = pd.Series([3000, 1500, 111250])
car_prices.mean()

We can try calling the `sum()` function on the entire `car_sales` 
`DataFrame`. We have the same "luck" as we did with `mean()`

In [None]:
try:
    car_sales.mean()
except:
    print('Same issues as `car_sales.sum()`')

And we use the same "solution".

In [None]:
car_sales[['Odometer (KM)', 'Doors']].mean()

The `pandas` package provides "typical" statistical functions
for numeric columns.

In [None]:
# Extract the column of interest (simply so we need not type it
# all the time.
mileages = car_sales['Odometer (KM)']
mileages.mean(), mileages.median(), mileages.mode()

The `len` function returns the number of rows in our data frame.

In [None]:
len(car_sales)