# Pandas
<center><img src="../images/stock/pexels-introspectivedsgn-4065800.jpg" width="500"></center>

pandas offers powerful data structures and manipulation tools that simplify data cleaning and analysis in Python. 

It's frequently used alongside NumPy for numerical operations, SciPy and statsmodels for statistical analysis, and Matplotlib for data visualization. 

pandas adopts NumPy's array-oriented computing paradigm, prioritizing array functions and avoiding explicit loops for data processing.


# Pandas Installation

You can install pandas with the following command:

```bash
!pip install pandas
```

However, just like with NumPy, I took the liberty of installing Pandas on the Jupyter Hub.

## Importing Pandas

We'll need to import pandas into your notebook/script before using it.

The standard convention for importing Pandas is as follows:

```python
import pandas as pd
```

Let's go ahead import pandas into our notebooks in the cell below:

In [1]:
# Import Pandas
import pandas as pd



# Pandas Data Structures
<center><img src="../images/stock/pexels-jeffrey-czum-254391-2346289.jpg" width="500"></center>
Pandas primarily uses two data structures:

* __Series__: 1-dimensional labeled array.
* __DataFrame__: 2-dimensional labeled table (collection of Series).

## Series

Think of a pandas Series as a single column of data with labels.

* __Default Labels__: By default, each item gets a number based on its position, just like in a Python list.
* __Custom Labels__: You can also give each item your own specific label (like a name or an ID). These labels can be numbers, text, or even combinations.
* __Data Types__: A Series can hold various types of data, but it's most efficient when all the items in a single Series are of the same type. This is important because, as we'll see later, Series become the columns in a DataFrame, and columns ideally have consistent data types.

__Series Illustration__

<center><img src="../images/illustrations/pandas-series-example.jpg"></center>

### Creating a Series from a List

We can easily create a pandas Series from a one-dimensional dataset, like a Python list, using the `pd.Series()` constructor.

For example, let's take this list and turn it into a Series:

In [4]:
# Data
popular_shows = [
    "Stranger Things",
    "The Mandalorian",
    "The Queen's Gambit",
    "Bridgerton",
    "Squid Game",
    "Succession",
    "Ted Lasso",
    "The Witcher",
    "Euphoria",
    "Ozark"
]

# Transform into a Series using pd.Series()
popular_shows_series = pd.Series(popular_shows)

# Output the Series
popular_shows_series

0       Stranger Things
1       The Mandalorian
2    The Queen's Gambit
3            Bridgerton
4            Squid Game
5            Succession
6             Ted Lasso
7           The Witcher
8              Euphoria
9                 Ozark
dtype: object

* Passing a list to `pd.Series()` creates a Series with automatic numeric labels (0, 1, 2, ...).
* The `.dtype` attribute tells us the data type of the elements within the Series.
* Text data (strings) are typically represented as the `object` dtype by default.

### Custom Labels

You can specify your own labels for the Series using the `index = sequence` argument within the `pd.Series()` function. For example:

In [11]:
# Data
popular_movies = [
    "Oppenheimer",
    "Barbie",
    "The Godfather",
    "Parasite",
    "Spirited Away"
]

# Transform into a Series with Custom Indices
indices = ["a", "b", "c", "d", "d"]

# Create a Series with the Custom Indice
movie_series = pd.Series(popular_movies,
                         index = indices)

# Output the Series
movie_series



a      Oppenheimer
b           Barbie
c    The Godfather
d         Parasite
d    Spirited Away
dtype: object

Now, the Series uses the custom labels we provided instead of the default numbers.

### Accessing Data - Index Operator `[]`

Similar to Python lists, we can retrieve data from a Series using square brackets `[]` with the label or position.

For example: Which popular movie is associated with the label `d`?

In [12]:
# Access the element at label d
movie_series["d"]



d         Parasite
d    Spirited Away
dtype: object

### Accessing Data - `.loc` Accessor

Another way to access Series data by its label is using the `.loc` attribute. For example:


In [13]:
# Access the element at label d
movie_series.loc["d"]



d         Parasite
d    Spirited Away
dtype: object

### Accessing Data - `.iloc` Accessor

Even with custom labels, you can still access elements by their numerical position (like in a list) using the `.iloc` attribute. 

For example:

In [14]:
# Access the element at position 3
movie_series.iloc[3]



'Parasite'

### Slicing Series Data

We can select multiple elements from a Series using slicing with the index operator `[]`, `.loc` (label-based slicing), and `.iloc` (position-based slicing). 

Here's how:

In [15]:
# Demonstrate [] slicing
movie_series["b":"d"]



b           Barbie
c    The Godfather
d         Parasite
d    Spirited Away
dtype: object

In [16]:
# Demonstrate .loc slicing
movie_series.loc["b":"d"]



b           Barbie
c    The Godfather
d         Parasite
d    Spirited Away
dtype: object

In [17]:
# Demonstrate .iloc slicing
movie_series.iloc[1:4]



b           Barbie
c    The Godfather
d         Parasite
dtype: object

## DataFrame

<center><img src="../images/stock/pexels-suki-lee-110686949-16200703.jpg" width="500"></center>

A pandas DataFrame is like a table with rows and columns. It's a 2D structure where each column can hold different types of data. Think of it as a collection of Series, all sharing the same row labels. Each column is essentially a Series.

__DataFrame Illustration__

<center><img src="../images/illustrations/pandas_dataframe_example.jpg"></center>

### Creating a DataFrame

##### `pd.DataFrame()`

We use the `pd.DataFrame()` function to create a pandas DataFrame.

In the code cell below, we will create a DataFrame from a given dictionary:

In [23]:
# Synthetic data
data = {
    'Name': ['TechGuru', 'FashionDiva', 'GameMaster', 'FoodieFun', 'TravelBug', 'MusicMania', 'BeautyQueen', 'DIYExpert', 'SportsFan', 'ComedyKing'],
    'Subscribers': [1500000, 2300000, 1800000, 1200000, 950000, 2700000, 1100000, 1600000, 2000000, 1400000],
    'Views': [120000000, 250000000, 180000000, 90000000, 60000000, 300000000, 80000000, 140000000, 220000000, 100000000],
    'Category': ['Tech', 'Fashion', 'Gaming', 'Food', 'Travel', 'Music', 'Beauty', 'DIY', 'Sports', 'Comedy'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia', 'USA', 'USA', 'Canada', 'USA', 'UK'],
    'DateStarted': ['2021-01-01', '2020-05-15', '2019-11-01', '2022-03-10', '2018-09-20', '2023-02-01', '2021-07-01', '2020-10-01', '2019-04-01', '2022-01-01'] # Added DateStarted
}

# Create the DataFrame
youtube_df = pd.DataFrame(data)

# Output the DataFrame
youtube_df

Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
0,TechGuru,1500000,120000000,Tech,USA,2021-01-01
1,FashionDiva,2300000,250000000,Fashion,Canada,2020-05-15
2,GameMaster,1800000,180000000,Gaming,UK,2019-11-01
3,FoodieFun,1200000,90000000,Food,USA,2022-03-10
4,TravelBug,950000,60000000,Travel,Australia,2018-09-20
5,MusicMania,2700000,300000000,Music,USA,2023-02-01
6,BeautyQueen,1100000,80000000,Beauty,USA,2021-07-01
7,DIYExpert,1600000,140000000,DIY,Canada,2020-10-01
8,SportsFan,2000000,220000000,Sports,USA,2019-04-01
9,ComedyKing,1400000,100000000,Comedy,UK,2022-01-01


__Note:__

* Jupyter has a neat feature where if the last thing in a cell is a DataFrame, it'll display as an HTML table without needing `print()`, which gives you a cleaner look than the standard text output.

### Initial DataFrame Inspection

Pandas makes it easy to get a quick understanding of your DataFrame. We'll introduce four fundamental methods for this:

* `df.head()`: to see the beginning of your data.
* `df.tail()`: to see the end of your data.
* `df.describe()`: for a statistical summary of numerical columns.
* `df.info()`: to check data types and non-null values.


#### `df.head()`

The `df.head()` method returns the first five rows of the DataFrame.

In [25]:
# Demonstration
youtube_df.head(3)




Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
0,TechGuru,1500000,120000000,Tech,USA,2021-01-01
1,FashionDiva,2300000,250000000,Fashion,Canada,2020-05-15
2,GameMaster,1800000,180000000,Gaming,UK,2019-11-01


#### `df.tail()`

The `df.tail()` method returns the last five rows of the DataFrame.

In [26]:
# Demonstration
youtube_df.tail(2)



Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
8,SportsFan,2000000,220000000,Sports,USA,2019-04-01
9,ComedyKing,1400000,100000000,Comedy,UK,2022-01-01


#### `df.describe()`

The `df.describe()` method is a powerful tool for quickly understanding the distribution of your numerical data within a DataFrame. When called, it computes and summarizes several key statistical measures for each numerical column:

* __count__: The number of non-missing (non-NaN) values.
* __mean__: The average value.
* __std__: The standard deviation, a measure of the spread or dispersion of the data.
* __min__: The minimum value.
* __max__: The maximum value.
* __25% (Q1)__: The first quartile, meaning 25% of the data falls below this value.
* __50% (Median or Q2)__: The middle value; 50% of the data is below and 50% is above.
* __75% (Q3)__: The third quartile, meaning 75% of the data falls below this value.

This output provides a concise overview of the central tendency, dispersion, and shape of the numerical data in your DataFrame.

In [27]:
# Demonstration
youtube_df.describe()




Unnamed: 0,Subscribers,Views
count,10.0,10.0
mean,1655000.0,154000000.0
std,552996.9,80443220.0
min,950000.0,60000000.0
25%,1250000.0,92500000.0
50%,1550000.0,130000000.0
75%,1950000.0,210000000.0
max,2700000.0,300000000.0


In [28]:
# .info()
youtube_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         10 non-null     object
 1   Subscribers  10 non-null     int64 
 2   Views        10 non-null     int64 
 3   Category     10 non-null     object
 4   Country      10 non-null     object
 5   DateStarted  10 non-null     object
dtypes: int64(2), object(4)
memory usage: 612.0+ bytes


### More DataFrame Creation Techniques

Beyond Python sequences, you can create Pandas DataFrames in several other ways:

* __From Series__: Combining one or more Series.
* __From Files__: Reading CSV, Excel, and other formats.
* __From APIs__: Retrieving data from web sources.

### Reading Data from Files
<center><img src="../images/stock/pexels-eva-bronzini-6068493.jpg" width="500"></center>
Pandas provides powerful and convenient functions to import data from various external file formats directly into DataFrames. This section will introduce you to these essential tools for bringing your data into the Pandas environment.

#### Reading CSV Files with pd.read_csv()

The `pd.read_csv()` function is your primary tool in Pandas for reading data stored in Comma Separated Values (CSV) files. This function is incredibly versatile and can handle a wide variety of CSV file structures. Let's explore its basic usage.

##### Reading CSV files from the Web
Pandas makes it incredibly convenient to read CSV files not only from your local computer but also directly from web URLs. This is particularly useful when working with publicly available datasets

Here is the general syntax:

```python
URL = "YOUR_CSV_FILE_URL_HERE"
df = pd.read_csv(URL)
```

Explanation:

1. We begin by importing the Pandas library as pd.
2. You will replace `"YOUR_CSV_FILE_URL_HERE"` with the specific web address of the CSV file you want to read. This URL is stored in a variable.
3. The `pd.read_csv(URL)` function is then called, which performs the following actions:
    1. __Fetches the data__: Pandas sends a request to the URL and retrieves the content of the CSV file.
    2. __Parses the data__: It interprets the comma-separated values and organizes them into a tabular structure.
    3. __Creates a DataFrame__: The parsed data is automatically loaded into a Pandas DataFrame, which we have named `df` in this example.
4. Once the data is in a DataFrame, you can use standard Pandas methods like `df.head()` to see the initial rows and `df.info()` to understand its structure (number of rows, columns, data types, and non-null values).

This method streamlines the process of working with online CSV datasets, allowing you to quickly load and begin analyzing data directly from the web.

#### Example: Reading Nike BikeTown Data from a URL
<center><img src="../images/generated/gemini_generated_panda_bike.jpeg" width="400"></center>
Now, let's put this into practice with some publicly available data from Nike BikeTown. We can directly read a CSV file containing trip data using the pd.read_csv() function and the file's web address.


Here's the URL for the BikeTown data:

```python
URL = "https://s3.amazonaws.com/biketown-tripdata-public/2018_05.csv"
```

For more BikeTown data, visit - [BikeTown - System Data](https://biketownpdx.com/system-data)

##### Importing and Inspecting the Data

Given the URL, let's do the following:

1. Read the BikeTown data from the provided URL into a DataFrame called `biketown_may`
2. Use the `.head()` method to display the first 5 rows of the DataFrame. What information do you think each column represents?

In [29]:
# Demonstration
URL = "https://s3.amazonaws.com/biketown-tripdata-public/2018_05.csv"

biketown_df = pd.read_csv(URL)


In [30]:
biketown_df.head()

Unnamed: 0,RouteID,PaymentPlan,StartHub,StartLatitude,StartLongitude,StartDate,StartTime,EndHub,EndLatitude,EndLongitude,EndDate,EndTime,TripType,BikeID,BikeName,Distance_Miles,Duration,RentalAccessPath,MultipleRental
0,6624288,Subscriber,NW 13th at Marshall,45.530804,-122.684423,5/1/2018,0:06,NW 2nd at Everett,45.525367,-122.672546,5/1/2018,0:11,,6503,0326 BIKETOWN,0.86,0:05:49,keypad,False
1,6624313,Subscriber,NW Johnson at Jamison Square,45.528637,-122.682019,5/1/2018,0:11,,45.526398,-122.689363,5/1/2018,0:16,,6162,0874 BIKETOWN,0.51,0:04:32,keypad_rfid_card,False
2,6624387,Subscriber,NW Marshall at Tanner Springs Park,45.530831,-122.681596,5/1/2018,0:24,NW Couch at 11th,45.523742,-122.681813,5/1/2018,0:27,,6535,0817 BIKETOWN,0.75,0:03:28,keypad,False
3,6624410,Subscriber,,45.516459,-122.630957,5/1/2018,0:28,,45.516527,-122.622968,5/1/2018,0:30,,6179,0503 BIKETOWN,0.35,0:02:29,keypad,False
4,6624448,Casual,,45.529077,-122.654351,5/1/2018,0:36,NE 11th at Holladay Park,45.530279,-122.654669,5/1/2018,0:34,,6548,0591 BIKETOWN,4.38,,keypad,False


3. Use the `.info()` method to get a concise summary of the DataFrame, including data types and non-null values.

In [31]:
# Demonstration
biketown_df.info()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79399 entries, 0 to 79398
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   RouteID           79399 non-null  int64  
 1   PaymentPlan       79399 non-null  object 
 2   StartHub          53666 non-null  object 
 3   StartLatitude     79377 non-null  float64
 4   StartLongitude    79377 non-null  float64
 5   StartDate         79399 non-null  object 
 6   StartTime         79399 non-null  object 
 7   EndHub            61560 non-null  object 
 8   EndLatitude       79377 non-null  float64
 9   EndLongitude      79377 non-null  float64
 10  EndDate           79399 non-null  object 
 11  EndTime           79399 non-null  object 
 12  TripType          47 non-null     object 
 13  BikeID            79399 non-null  int64  
 14  BikeName          79399 non-null  object 
 15  Distance_Miles    79399 non-null  float64
 16  Duration          78420 non-null  object

4. Use the `.shape` attribute to find the number of rows and columns in the DataFrame.

In [32]:
# Demonstration
biketown_df.shape




(79399, 19)

In [34]:
# .describe
biketown_df.describe()

Unnamed: 0,RouteID,StartLatitude,StartLongitude,EndLatitude,EndLongitude,BikeID,Distance_Miles
count,79399.0,79377.0,79377.0,79377.0,79377.0,79399.0,79399.0
mean,6974648.0,45.522341,-122.671874,45.522151,-122.671563,6852.502173,2.06869
std,175930.1,0.012608,0.017123,0.012421,0.016824,1606.305738,18.721801
min,6624288.0,45.353495,-122.81616,45.353495,-122.854652,5986.0,0.0
25%,6844826.0,45.51547,-122.68243,45.515568,-122.682019,6301.0,0.76
50%,6991522.0,45.52196,-122.673892,45.521895,-122.673892,6564.0,1.44
75%,7132083.0,45.528637,-122.663338,45.528637,-122.663338,7149.5,2.72
max,7269113.0,45.629674,-122.477075,45.629674,-122.479978,21157.0,5251.59


##### Selecting and Inspecting Columns

1. Select the `StartHub` column and display the first 10 unique values using the `.unique()` method. What are some of the starting bike station names?

2. Select the `Distance_Miles` column. Calculate and display the minimum and maximum values of the duration column using `.min()` and `.max()`.

In [35]:
# Demonstration
distance_max = biketown_df["Distance_Miles"].max()
distance_min = biketown_df["Distance_Miles"].min()

print(f"Maximum Distance: {distance_max}")
print(f"Minimum Distance: {distance_min}")

Maximum Distance: 5251.59
Minimum Distance: 0.0


##### Descriptive Statistics

* Use the `describe()` method on the `Distance_Miles` column to get summary statistics like mean, standard deviation, min, max, and quartiles. 

In [33]:
# Demonstration
biketown_df["Distance_Miles"].describe()

count    79399.000000
mean         2.068690
std         18.721801
min          0.000000
25%          0.760000
50%          1.440000
75%          2.720000
max       5251.590000
Name: Distance_Miles, dtype: float64

##### Value Counts

* Find the top 5 most frequent starting stations using the `.value_counts()` method on the `StartHub` column.
* Find the number of unique `BikeID` values in the DataFrame using `.nunique()`. This tells you how many individual bikes were used in May 2020.
* Examine the `PaymentPlan` column using `.value_counts()`. What are the different payment plans and their counts?

In [41]:
# Demonstration
biketown_df["StartHub"].value_counts().head()

StartHub
SW Salmon at Waterfront Park        2328
SW River at Montgomery              1413
SW Moody at Aerial Tram Terminal    1239
SW Naito at Ankeny Plaza            1201
SW Naito at Morrison                1019
Name: count, dtype: int64

#### Beyond CSV: Other File Reading Methods

Pandas also provides functions for reading other common file types:

* `pd.read_excel()`: For reading data from Excel files (.xlsx, .xls).
* `pd.read_json()`: For reading data from JSON (JavaScript Object Notation) files.
* `pd.read_table()`: For reading delimited text files (similar to CSV but with customizable delimiters).
* `pd.read_parquet()`: For reading data from Parquet files, a columnar storage format.
* `pd.read_pickle()`: For reading serialized Python objects.
* `pd.read_sql()`: For reading data from SQL databases. 

### Reading Data from an API

Want to know how Tesla's been performing? Let's grab their stock data from the past year using the Yahoo Finance API and the `yfinance` library. This will give us a DataFrame to analyze.

The `yfinance` library has been installed on our server.

```bash
!pip install yfinance
```

In [None]:
# Install yfinance 
!pip install yfinance

Next, let's import the necessary yfinance library:

```python
import yfinance as yf
```

In [39]:
# Import yfinance
import pandas as pd
import yfinance as yf

#### Import and Inspect the Data

Now, we'll use the `yfinance.Ticker` method to fetch 1 year of Tesla's stock data.

The basic format is:

```python
data = yf.Ticker(SYMBOL).history(period="1y")
```

For Tesla, the ticker symbol is `TSLA`.

For more details on the yfinance API, you can check out the official documentation: (The yfinance API Reference)[https://ranaroussi.github.io/yfinance/]

* Let's get and inspect the data using the `yf.download()` and `.info`.

In [40]:
# Demonstration
TSLA = yf.Ticker("TSLA").history(period="1y")

TSLA

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-10-03 00:00:00-04:00,244.479996,249.789993,237.809998,240.660004,80729200,0.0,0.0
2024-10-04 00:00:00-04:00,246.690002,250.960007,244.580002,250.080002,86573200,0.0,0.0
2024-10-07 00:00:00-04:00,249.000000,249.830002,240.699997,240.830002,68113300,0.0,0.0
2024-10-08 00:00:00-04:00,243.559998,246.210007,240.559998,244.500000,56303200,0.0,0.0
2024-10-09 00:00:00-04:00,243.820007,247.429993,239.509995,241.050003,66289500,0.0,0.0
...,...,...,...,...,...,...,...
2025-09-26 00:00:00-04:00,428.299988,440.470001,421.019989,440.399994,101628200,0.0,0.0
2025-09-29 00:00:00-04:00,444.350006,450.980011,439.500000,443.209991,79491500,0.0,0.0
2025-09-30 00:00:00-04:00,441.519989,445.000000,433.119995,444.720001,74358000,0.0,0.0
2025-10-01 00:00:00-04:00,443.799988,462.290009,440.750000,459.459991,98122300,0.0,0.0


The result from our API call is a pandas dataframe which can can apply `.info()`, `.head()`, `.tail()`, and `.describe()`.

Actually...let's do that.

Let's start with `.head()` in the cell below to preview the data:

In [None]:
# Demonstration

TSLA.head()

And now let's get a statistical summary of the dataset in the cell below. 

Which method is that again?

In [None]:
# Demonstration
TSLA.describe()

Are there any insights that we can gather from the `Close` column?

Some context - the `Close` column indicates the stock price of a publicly traded company at the end of a trading day. These values can fluctuate based on the company's performance or overall market sentiment, which can be positive or negative.

Based on the maximum and minimum closing stock prices, what can we infer about the company's performance over the observed period?

## Conclusion

Throughout this session, you've gained a foundational understanding of:

* __Pandas:__ Your go to library for data manipulation and analysis in Python.
* __DataFrames:__ The primary data structure in Pandas for handling tabular data
* __Series:__ The one-dimensional building block of DataFrames
* __Loading Data:__ Importing external datasets into your Jupyter Notebook
* __Initial Data Exploration:__ Methods like `.head()`, `.info()`, and `.describe()` are now tools you are ready to use for quickly understanding your datasets.

By now, you should feel comfortable loading a dataset and performing a quick initial assessment to get a feel for its contents.