# Introduction to Pandas
<img src="https://raw.githubusercontent.com/earthinversion/earthinversion-images/main/images/pandas-python.png" width="200" style="float: center"/>

- Pandas is designed for manipulating mixtures of data types in <span style="color:blue">***tabular formats*** </span> (much like Excel spreadsheets).
- Pandas contains data structures and data manipulation tools designed to make <span style="color:blue">***data cleaning and analysis*** </span> fast and easy in Python

**Difference between Pandas and NumPy**

- Pandas is designed for working with ***tabular or heterogeneous data***.
- NumPy, by contrast, is best suited for working with ***homogeneous numerical array data***.
- Pandas is popularly used for ***data analysis and visualization***.
- NumPy is popularly used for ***numerical calculations***.
---

**Why Pandas is named so?**

The name "pandas" is derived from "panel data," which is an econometrics term for data sets that include observations over multiple time periods for the same individuals or entities.

The library was originally developed by **Wes McKinney** in 2008 while working at AQR Capital Management, where he was working with panel data. The name "pandas" was chosen because the library was initially designed to handle panel data sets efficiently.

Over time, the pandas library has evolved to handle a wide range of data types and structures beyond panel data, making it one of the most popular tools for data analysis and manipulation in Python.

---

### Outline:
- Series vs. DataFrame

- Working with Series

- Working with DataFrames

- Importing Data from .csv

- Creating Filters

---

## Using Pandas
We generally import Pandas under the alias `pd`

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Check Pandas Version
pd.__version__

## Pandas Data Structures: *Series* and *DataFrame*

## Series

A Pandas "Series" is a one-dimensional object, like an array.

In [None]:
mySeries1 = pd.Series([8, 3 , -6, 7])
mySeries1

- The left column shows the "indices".  By default, these will run from 0 to (number of entries - 1).

- The right column shows the "values".


In [None]:
# We can extract just the values:
mySeries1.values

In [None]:
# We can also look at the indices:
mySeries1.index

- This is like `range(0, len(mySeries1))`

One useful pandas feature is that we can define custom indices:

In [None]:
mySeries2 = pd.Series([8, 3, -6, 7], index = ['c', 'a', 'b', 'xyz'])
mySeries2

Take a look at the 3rd row:

In [None]:
# We can use the index name:
mySeries2['b']

In [None]:
# This is the same as above, but uses the index number
mySeries2[2]

In [None]:
# We can create a Series from a python dictionary:
myDict = {'HW1': 90, 'Exam 1': 77, 'Project': 88, 'HW2': 66}

mySeries3 = pd.Series(myDict)
mySeries3

- Note that, by default, Pandas sorts by key/index

In [None]:
# We can provide an explicit ordering of indices:
assignments = ['HW1', 'HW2', 'HW3', 'Exam 1', 'Project']
mySeries4 = pd.Series(myDict, index = assignments)
mySeries4

- Note that index 'HW3' doesn't appear in myDict.  "NaN" stands for "Not a Number"; it represents a null/missing value.

`pd.isnull()`: This function indicates whether values are missing

In [None]:
pd.isnull(mySeries4)

`pd.notnull()`: This function detects non-missing values

In [None]:
pd.notnull(mySeries4)

#### We can also change the indices:

In [None]:
mySeries1

In [None]:
mySeries1.index = ['a', 'x', 'b', 'z']
mySeries1

### Series Indexing

In [None]:
mySeries5 = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
mySeries5

In [None]:
# We can use the row numbers:
mySeries5[1:3]

In [None]:
# We can also use the index labels:
mySeries5['b':'d']

## DataFrame

A Pandas "DataFrame" represents a table of data.

Each column in a Pandas DataFrame can contain a different type of data.

*This example comes from Wes McKinney's book.*

In [None]:
# Suppose we already have some data in the form of a dictionary:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [None]:
# Convert this to a pandas DataFrame:
frame1 = pd.DataFrame(data)
frame1

In [None]:
# Look at the first 5 rows:
frame1.head()

In [None]:
# Look at the last 5 rows:
frame1.tail()

While using `head()` or `tail()` function, the default number of elements printed is 5.

This value can be changed by providing an input to the function such as: `head(10)` or `tail(10)`  

**Checking the column data types**

In [None]:
# Check the column data types using the dtypes attribute
frame1.dtypes

In [None]:
# Use the shape attribute to get the number of rows and columns in your dataframe
frame1.shape

How would you print just the number of rows in your dataframe?

In [None]:
# The info method gives the column datatypes + number of non-null values
frame1.info()

In [None]:
frame1

In [None]:
# We can specify the order in which columns are displayed:
pd.DataFrame(data, columns = ['year', 'state', 'pop'])

In [None]:
# Let's create another dataframe.
# We've added a new column (debt).
# We've also specified the index names.
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'],
                      index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

In [None]:
# Assign a scalar value to all rows in a given column:
frame2['debt'] = 16.5
frame2

In [None]:
# Assign the values of a column via a list or array:
frame2['debt'] = np.arange(6)
frame2

In [None]:
# The following won't work because the list doesn't match the number of rows in frame2:
frame2['debt'] = np.arange(7)
frame2['debt'] = np.arange(3)

In [None]:
# However, if we assign a pandas Series to a DataFrame column, pandas will fill in the gaps with NaN:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2

In [None]:
# Add a new column:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

In [None]:
# Remove a column:
del frame2['eastern']
frame2

**How can you delete multiple columns?**

Use `drop()` function with `axis = 1`

In [None]:
frame2.drop(['pop', 'debt'], axis = 1)

By default `axis = 0` in `drop()`. This default behavior will delete rows.

In [None]:
# Deleting rows
frame2.drop(['one','two'])

Are rows removed from `frame2`?

**How to create a copy of dataframe?**

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

# Convert data to a pandas DataFrame:
frame1 = pd.DataFrame(data)
frame1

In [None]:
frame1_copy = frame1
frame1_copy

In [None]:
# Remove a column from the copied dataframe:
del frame1_copy['pop']
frame1_copy

In [None]:
# Take a look at the original dataframe
frame1

What do you observe?

### DataFrame Indexing

In [None]:
# Let's consider frame2
frame2 = pd.DataFrame(data, columns = ['year', 'state', 'pop', 'debt'],
                      index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

In [None]:
# Get a list of all columns:
frame2.columns

In [None]:
# Retrieving a specific column:
frame2['year']

In [None]:
# Retrieving multiple specific columns:
frame2[['year', 'pop']]

In [None]:
frame2

In [None]:
# Retrieving a specific row:
# a) by row index name, using "loc"
frame2.loc['one']

In [None]:
frame2.loc['one':'four']   # Note that 'four' is included

In [None]:
frame2.loc[['one', 'four']]

In [None]:
# b) by row index ID, using "iloc"
frame2.iloc[0]

In [None]:
frame2.iloc[0:3]   # Note that 'four' is NOT included

In [None]:
frame2.iloc[[0, 3]]

In [None]:
# Select a subset of rows and columns:
frame2.loc['one', ['year', 'pop']]

In [None]:
frame2.loc['one', 'year':'pop']

In [None]:
frame2.loc['one':'three', 'year':'pop']

## Importing Data from .csv

#### First, suppose we have a .csv file, named "car_financing.csv".

We need to upload file `car_financing.csv` on Colab directory so that it can be imported in the notebook.

To upload `car_financing.csv`, run the following code and upload the file from your computer/laptop.

In [None]:
# Uploading example_with_header.csv file
from google.colab import files
uploaded = files.upload()

In [None]:
# Use "read_csv()"
df = pd.read_csv('car_financing.csv')
df.head()

## Filtering Data

Filter out the data to only have data `car_type` of 'Toyota Sienna' and `interest_rate` of 0.0702.

In [None]:
# Let's first start by looking at the car_type column.
# There is a 'function' called value_counts(). It finds the number of unique rows.
df['car_type'].value_counts()

Filter for the `car_type`

In [None]:
# Notice that the filter produces a pandas series of True and False values
car_filter = df['car_type'] == 'Toyota Sienna'
car_filter

In [None]:
# Filter dataframe to get a new DataFrame of all columns, but only 'Toyota Sienna' rows.
sienna_df = df[car_filter]
display(sienna_df)
sienna_df['car_type'].value_counts()

Filter for the `interest_rate`

Comparison Operator | Meaning
--- | ---
< | less than
<= | less than or equal to
> | greater than
>= | greater than or equal to
== | equal
!= | not equal

In [None]:
# Create a filter for a specific interest rate
interest_filter = df['interest_rate'] == 0.0702
interest_filter

In [None]:
# Apply the filter
specificInterest_df = df[interest_filter]
# This will be only the rows with the .0702 interest rate. All other rows were dropped.
specificInterest_df

### Combining Filters
In the previous sections, we created `car_filter` and `interest_filter`. We could do this all at one time.

Bitwise Logic Operator | Meaning
--- | ---
& | and
\| | or
~ | not

In [None]:
# Apply both filters to the DataFrame.
new_df = df[car_filter & interest_filter]
new_df