# pandas
<center><img src="../images/stock/pexels-introspectivedsgn-4065800.jpg" width="500"></center>

pandas offers powerful data structures and manipulation tools that simplify data cleaning and analysis in Python. 

It's frequently used alongside NumPy for numerical operations, SciPy and statsmodels for statistical analysis, and Matplotlib for data visualization. 

pandas adopts NumPy's array-oriented computing paradigm, prioritizing array functions and avoiding explicit loops for data processing.


# Pandas Installation

You can install pandas with the following command:

```bash
!pip install pandas
```

However, just like with NumPy, I took the liberty of installing Pandas on the Jupyter Hub.

## Importing Pandas

We'll need to import pandas into your notebook/script before using it.

The standard convention for importing Pandas is as follows:

```python
import pandas as pd
```

Let's go ahead import pandas into our notebooks in the cell below:

In [None]:
# Import Pandas
import pandas as pd
import random


# Pandas Data Structures
<center><img src="../images/stock/pexels-jeffrey-czum-254391-2346289.jpg" width="500"></center>
Pandas primarily uses two data structures:

* __Series__: 1-dimensional labeled array.
* __DataFrame__: 2-dimensional labeled table (collection of Series).

## Series

Think of a pandas Series as a single column of data with labels.

* __Default Labels__: By default, each item gets a number based on its position, just like in a Python list.
* __Custom Labels__: You can also give each item your own specific label (like a name or an ID). These labels can be numbers, text, or even combinations.
* __Data Types__: A Series can hold various types of data, but it's most efficient when all the items in a single Series are of the same type. This is important because, as we'll see later, Series become the columns in a DataFrame, and columns ideally have consistent data types.

### Creating a Series from a List

We can easily create a pandas Series from a one-dimensional dataset, like a Python list, using the `pd.Series()` constructor.

For example, let's take this list and turn it into a Series:

In [None]:
# Data
popular_shows = [
    "Stranger Things",
    "The Mandalorian",
    "The Queen's Gambit",
    "Bridgerton",
    "Squid Game",
    "Succession",
    "Ted Lasso",
    "The Witcher",
    "Euphoria",
    "Ozark"
]

# Transform into a Series using pd.Series()


# Output the Series



* Passing a list to `pd.Series()` creates a Series with automatic numeric labels (0, 1, 2, ...).
* The `.dtype` attribute tells us the data type of the elements within the Series.
* Text data (strings) are typically represented as the `object` dtype by default.

### Custom Labels

You can specify your own labels for the Series using the index argument within the `pd.Series()` function. For example:

In [None]:
# Data
popular_movies = [
    "Oppenheimer",
    "Barbie",
    "The Godfather",
    "Parasite",
    "Spirited Away"
]

# Transform into a Series with Custom Indices
indices = list(range(100,105))





Now, the Series uses the custom labels we provided instead of the default numbers.

### Accessing Data - Index Operator `[]`

Similar to Python lists, we can retrieve data from a Series using square brackets `[]` with the label or position.

For example: Which popular movie is associated with the label `103`?

In [None]:
# Access the element at label 103




### Accessing Data - `.loc` Accessor

Another way to access Series data by its label is using the `.loc` attribute. For example:


In [None]:
# Access the element at label 104




### Accessing Data - `.iloc` Accessor

Even with custom labels, you can still access elements by their numerical position (like in a list) using the `.iloc` attribute. 

For example:

In [None]:
# Access the element at position 0




### Slicing Series Data

We can select multiple elements from a Series using slicing with the index operator `[]`, `.loc` (label-based slicing), and `.iloc` (position-based slicing). 

Here's how:

In [None]:
# Demonstrate [] slicing


# Demonstrate .loc slicing


# Demonstrate .iloc slicing



## DataFrame

<center><img src="../images/stock/pexels-suki-lee-110686949-16200703.jpg" width="500"></center>

A pandas DataFrame is like a table with rows and columns. It's a 2D structure where each column can hold different types of data. Think of it as a collection of Series, all sharing the same row labels. Each column is essentially a Series.

### Creating a DataFrame

##### `pd.DataFrame()`

We use the `pd.DataFrame()` function to create a pandas DataFrame.

In [None]:
# Synthetic data
data = {
    'Name': ['TechGuru', 'FashionDiva', 'GameMaster', 'FoodieFun', 'TravelBug', 'MusicMania', 'BeautyQueen', 'DIYExpert', 'SportsFan', 'ComedyKing'],
    'Subscribers': [1500000, 2300000, 1800000, 1200000, 950000, 2700000, 1100000, 1600000, 2000000, 1400000],
    'Views': [120000000, 250000000, 180000000, 90000000, 60000000, 300000000, 80000000, 140000000, 220000000, 100000000],
    'Category': ['Tech', 'Fashion', 'Gaming', 'Food', 'Travel', 'Music', 'Beauty', 'DIY', 'Sports', 'Comedy'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia', 'USA', 'USA', 'Canada', 'USA', 'UK'],
    'DateStarted': ['2021-01-01', '2020-05-15', '2019-11-01', '2022-03-10', '2018-09-20', '2023-02-01', '2021-07-01', '2020-10-01', '2019-04-01', '2022-01-01'] # Added DateStarted
}

# Create the DataFrame





__Note:__

* Jupyter has a neat feature where if the last thing in a cell is a DataFrame, it'll display as an HTML table without needing `print()`, which gives you a cleaner look than the standard text output.

### Initial DataFrame Inspection

Pandas makes it easy to get a quick understanding of your DataFrame. We'll introduce four fundamental methods for this:

* `df.head()`: to see the beginning of your data.
* `df.tail()`: to see the end of your data.
* `df.describe()`: for a statistical summary of numerical columns.
* `df.info()`: to check data types and non-null values.


#### `df.head()`

The `df.head()` method returns the first five rows of the DataFrame.

In [None]:
# Demonstration





#### `df.tail()`

The `df.tail()` method returns the last five rows of the DataFrame.

In [None]:
# Demonstration





#### `df.describe()`

The `df.describe()` method is a powerful tool for quickly understanding the distribution of your numerical data within a DataFrame. When called, it computes and summarizes several key statistical measures for each numerical column:

* __count__: The number of non-missing (non-NaN) values.
* __mean__: The average value.
* __std__: The standard deviation, a measure of the spread or dispersion of the data.
* __min__: The minimum value.
* __max__: The maximum value.
* __25% (Q1)__: The first quartile, meaning 25% of the data falls below this value.
* __50% (Median or Q2)__: The middle value; 50% of the data is below and 50% is above.
* __75% (Q3)__: The third quartile, meaning 75% of the data falls below this value.

This output provides a concise overview of the central tendency, dispersion, and shape of the numerical data in your DataFrame.

In [None]:
# Demonstration





### DataFrame Restructuring
<center><img src="../images/stock/pexels-lgorincioi-8457645.jpg" width="500"></center>

This section focuses on how to modify the structure of your Pandas DataFrame, including:

* __Index Manipulation__: Changing or resetting the row labels.
* __Column Ordering__: Arranging columns as needed.
* __Column Removal__: Dropping specific columns from the DataFrame.

#### Index Column

By default, DataFrames have a numerical index starting from zero, just like Series.

However, you can set one or more of your existing columns as the DataFrame's index.

For example, let's use the `DateStarted` column as the new index:

In [None]:
# Set 'DateStarted' as the index





After setting `DateStarted` as the index in our YouTuber example, the row labels are now date values.

Remember, DataFrame indexes in pandas can be of any data type. While integers and strings are most frequent, you can also use more complex types like lists, tuples, or even arbitrary Python objects as your index.

#### Specifying Column Order

By providing an ordered sequence of column names, you can control the order in which the columns appear in the resulting DataFrame.

To specify the order of columns when creating a DataFrame (e.g., from a dictionary or list of lists), you pass a list of the desired column names to the columns parameter:

```python
pd.DataFrame(data, columns=[
    'column_name1', 
    'column_name2', 
    'column_name3'
])
```

Similarly, to reorder existing columns, you can reassign the DataFrame with the desired column order:

```python
df = df[[
    'column_name2', 
    'column_name1', 
    'column_name3'
]]
```

In [None]:
# Demonstration




### More DataFrame Creation Techniques

Beyond dictionaries, you can create Pandas DataFrames in several other ways:

* __From Series__: Combining one or more Series.
* __From Files__: Reading CSV, Excel, and other formats.
* __From APIs__: Retrieving data from web sources.

#### Combining Series into a DataFrame

You can create a DataFrame by putting multiple pandas Series together. Each Series will become a column in the resulting DataFrame.

In [None]:
# Synthetic Produce Data
produce_names = pd.Series(['Apple', 'Banana', 'Carrot', 'Date', 'Eggplant'])
quantities = pd.Series(list(range(50, 251, 50)))
prices = pd.Series(list(round(random.uniform(0.1, 3.0,),2) for _ in range(5)))

# Create Produce DataFrame




# Output Data Frame





__How it Works__

In that example:

* We used `pd.DataFrame()` to build the DataFrame.
* We provided a Python dictionary.
    * Keys: The dictionary keys became the labels for each column in the DataFrame.
    * Values: The dictionary values were the pandas Series, and these Series provided the actual data for their respective columns.
* Essentially, each Series in the dictionary transformed into a column in the DataFrame.

### Reading Data from Files
<center><img src="../images/stock/pexels-eva-bronzini-6068493.jpg" width="500"></center>
Pandas provides powerful and convenient functions to import data from various external file formats directly into DataFrames. This section will introduce you to these essential tools for bringing your data into the Pandas environment.

### Reading CSV Files with pd.read_csv()

The `pd.read_csv()` function is your primary tool in Pandas for reading data stored in Comma Separated Values (CSV) files. This function is incredibly versatile and can handle a wide variety of CSV file structures. Let's explore its basic usage.

#### Reading CSV files from the Web
Pandas makes it incredibly convenient to read CSV files not only from your local computer but also directly from web URLs. This is particularly useful when working with publicly available datasets

Here is the general syntax:

```python
URL = "YOUR_CSV_FILE_URL_HERE"
df = pd.read_csv(URL)
```

Explanation:

1. We begin by importing the Pandas library as pd.
2. You will replace `"YOUR_CSV_FILE_URL_HERE"` with the specific web address of the CSV file you want to read. This URL is stored in a variable.
3. The `pd.read_csv(URL)` function is then called, which performs the following actions:
    1. __Fetches the data__: Pandas sends a request to the URL and retrieves the content of the CSV file.
    2. __Parses the data__: It interprets the comma-separated values and organizes them into a tabular structure.
    3. __Creates a DataFrame__: The parsed data is automatically loaded into a Pandas DataFrame, which we have named `df` in this example.
4. Once the data is in a DataFrame, you can use standard Pandas methods like `df.head()` to see the initial rows and `df.info()` to understand its structure (number of rows, columns, data types, and non-null values).

This method streamlines the process of working with online CSV datasets, allowing you to quickly load and begin analyzing data directly from the web.

#### Example: Reading Nike BikeTown Data from a URL
<center><img src="../images/generated/gemini_generated_panda_bike.jpeg" width="400"></center>
Now, let's put this into practice with some publicly available data from Nike BikeTown. We can directly read a CSV file containing trip data using the pd.read_csv() function and the file's web address.


Here's the URL for the BikeTown data:

```python
URL = "https://s3.amazonaws.com/biketown-tripdata-public/2018_05.csv"
```

For more BikeTown data, visit - [BikeTown - System Data](https://biketownpdx.com/system-data)

##### Importing and Inspecting the Data

Given the URL, let's do the following:

1. Read the BikeTown data from the provided URL into a DataFrame called `biketown_may`
2. Use the `.head()` method to display the first 5 rows of the DataFrame. What information do you think each column represents?

In [None]:
# Demonstration





3. Use the `.info()` method to get a concise summary of the DataFrame, including data types and non-null values.

In [None]:
# Demonstration





4. Use the `.shape` attribute to find the number of rows and columns in the DataFrame.

In [None]:
# Demonstration





#### Selecting and Inspecting Columns

1. Select the `StartHub` column and display the first 10 unique values using the `.unique()` method. What are some of the starting bike station names?

2. Select the `Distance` column. What do you think the units of this column are? Calculate and display the minimum and maximum values of the duration column using `.min()` and `.max()`.

In [None]:
# Demonstration










#### Descriptive Statistics

* Use the `describe()` method on the `Distance_Miles` column to get summary statistics like mean, standard deviation, min, max, and quartiles. 

In [None]:
# Demonstration


#### Value Counts

* Find the top 5 most frequent starting stations using the `.value_counts()` method on the `StartHub` column.
* Find the number of unique `BikeID` values in the DataFrame using `.nunique()`. This tells you how many individual bikes were used in May 2020.
* Examine the `PaymentPlan` column using `.value_counts()`. What are the different payment plans and their counts?

In [None]:
# Demonstration










### Beyond CSV: Other File Reading Methods

Pandas also provides functions for reading other common file types:

* __pd.read_excel()__: For reading data from Excel files (.xlsx, .xls).
* __pd.read_json()__: For reading data from JSON (JavaScript Object Notation) files.
* __pd.read_table()__: For reading delimited text files (similar to CSV but with customizable delimiters).
* __pd.read_parquet()__: For reading data from Parquet files, a columnar storage format.
* __pd.read_pickle()__: For reading serialized Python objects.
* __pd.read_sql()__: For reading data from SQL databases. 

### Reading Data from an API

Want to know how Tesla's been performing? Let's grab their stock data from the past year using the Yahoo Finance API and the `yfinance` library. This will give us a DataFrame to analyze.

First, we need to install the library:

```bash
!pip install yfinance
```

In [None]:
# Install yfinance 


Next, let's import the necessary yfinance library:

```python
import yfinance as yf
```

In [None]:
# Import yfinance


#### Import and Inspect the Data

Now, we'll use the `yfinance.download()` function to fetch one year of Tesla's stock data.

The basic format is:

```python
yf.download(_tickers_, _period_)
```

For Tesla, the ticker symbol is `TSLA`, and we want one year of data (`1y`).

For more details on the yfinance API, you can check out the official documentation: (The yfinance API Reference)[https://yfinance-python.org/reference/]

* Let's get and inspect the data using the `yf.download()` method and `.head()`.

In [None]:
# Demonstration





#### Examining Key Information Within the DataFrame

* From the `tsla_data` DataFrame, show the first ten values in the `Close` column. 

* Determine and display both the lowest and the highest closing prices recorded in the `tsla_data` DataFrame over the past year.

* Using the `.idxmin()` method, find the date on which the lowest closing price occurred. Similarly, using the `.idxmax()` method, find the date on which the highest closing price occurred.

In [None]:
# Demonstration
