# 🐼 Pandas Notes
**Spring 2025 – Data Practice Journal**

This notebook contains notes and examples from working through the **Pandas** section of Codecademy's Analyze Financial Data with Python course.  
(Some code snippets notes have been adapted from Codecademy for practice purposes.)

_Note: This notebook evolves over time as I continue to practice and build fluency._

In [1]:
import pandas as pd

In [2]:
import numpy as np

## DataFrames

### Create a DataFrame

A DataFrame is an object that stores data as rows and columns. It can be created manually or filled with data from a CSV, Excel spreadsheet, or SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer.

DataFrames may contain different data types such as strings, ints, floats, tuples, etc.

A dictionary can be passed into pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length. Example:

In [4]:
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})

print(df1)

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


Data can also be added using *lists*.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword `columns` to pass a list of column names.

In [5]:
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])

print(df2)

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


## DataFrames: Loading and Saving CSVs

To load CSV data into a DataFrame in Pandas, use `.read_csv()`

In [18]:
 sample_data_frame = pd.read_csv('sample.csv')
# In this example, we read data from an existing CSV file into a variable called sample_data_frame
# (Not a real CSV file, code will not run.)

To save data to a CSV, use `.to_csv()`

In [12]:
df1.to_csv('new-csv-file.csv')
# In this example, df1 is the DataFrame object on wich the .to_csv() method is called.

## Inspect a DataFrame
If a DataFrame is small, you can display it using `print(df)`.

If it's a larger DataFrame, it's helpful to be able to inspeact a few items without having to look at the entire DataFrame.

The method `.head()` displays the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument `n`. For example, `df.head(10)` would show the first 10 rows.

The method **`df.info()`** gives some statistics for each column.

In [13]:
print(df1.head())

         name         address  age
0  John Smith    123 Main St.   34
1    Jane Doe  456 Maple Ave.   28
2   Joe Schmo    789 Broadway   51


In [14]:
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     3 non-null      object
 1   address  3 non-null      object
 2   age      3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes
None


## Select Columns
Two possible syntaxes for selecting all values from a column:
1. Select the column as if you were selecting a value from a dictionary using a key. E.g., `customers['age']`
2. If the name of a column follows all the rules for avariable name (doesn't start w/ a number, doesn't contain spaces or special chars, etc.), then you can selecting using *dot notation*:

In [None]:
df.MySecondColumn

In this example, we would type `customers.age`

When we select a single column, the result is called a *Series*.

## Select Multiple Columns
Use double brackets and column names separated by commas:

In [None]:
new_df = orders[['last_name', 'email']]

## Select Rows
DataFrames are zero-indexed, which applies when selecting rows.
For example, if you wanted to select the third row from a DataFrame called orders, you would use the following:

In [None]:
orders.iloc[2]

## Select Multiple Rows

Several ways to select multiple rows:

`orders.iloc[3:7]` selects all rows starting at the 3rd row and up to but *not including* the 7th row

`orders.iloc[:4]` selects all rows up to, but *not including* the 4th row

`orders.iloc[-3:]` selects the rows starting at the 3rd to last orw and upto and *including* the final row

## Select Rows with Logic
You can select a subset of a DataFrame by using logical statements:
`df[df.MyColumnName == desired_column_value]`

If we wanted to select all rows where the customer's age is 30:
`df[df.age == 30]`

In Python `==` tests for equality. We can also use other logical statements such as:

Greater than, `>`
`df[df.age > 30]`

Less than, `<`
`df[df.age < 30]`

Not equal, `!=` -- This selects all rows where the customer's name is *not* `Clara Oswald`:
`df[df.name != 'Clara Oswald']`

You can also **combine multiple logical statements**.

In Python, `|` is "or" and `&` is "and".

Example: to select all rows where the customer’s age was under 30 or the customer’s name was “Martha Jones”:

In [17]:
df[(df.age < 30) | 
   (df.name == 'Martha Jones')]

NameError: name 'df' is not defined

Suppose we want to select the rows where the customer’s name is either “Martha Jones”, “Rose Tyler” or “Amy Pond”.

We could use the `isin` command to check that df.name is one of a list of values:

In [None]:
df[df.name.isin(['Martha Jones',
     'Rose Tyler',
     'Amy Pond'])]

## Setting Indices
Selecting a subset of a DataFrame using logic results in non-consecutive indices. This is inelegant and makes it difficult to use `iloc()`.

This can be corrected using the method `.reset_index()`. Running this on a dataframe will reset the indices and move the old indices to a column called `index`. If not needed, you can use the keyword `drop=True` to avoid getting the extra column.

On a DataFrame called `df`, the full command would be:

`df.reset_index(drop=True)`

Using `reset_index()` will return a new Dataframe. To simply modify an existing DataFrame, use the keyword `inplace=True`.

## Adding a Column
One way to add a new column is by giving a list of the same length as the existing DataFrame.

In [None]:
df['Quantity'] = [50, 120, 95, 60]

To add a new column that has the same value for all rows in the DataFrame:

In [None]:
df['In Stock?'] = True

You can also add a new column by performing a function on the existing columns.

For example, you might want to add a column with sales tax for each item. The below multiplies each `Price` by `0.075`, the sales tax for this state:

In [None]:
df['Sales Tax'] = df.Price * 0.075

## Performing Column Operations
Use the `apply` function to apply a functino to every value in a particular column.

For example, this code overwrites the existing `Name` column by applying the function `upper` to every row in `Name`.

In [None]:
df['Name'] = df.Name.apply(str.upper)