# pandas
<center><img src="../images/stock/pexels-introspectivedsgn-4065800.jpg" width="500"></center>

pandas offers powerful data structures and manipulation tools that simplify data cleaning and analysis in Python. 

It's frequently used alongside NumPy for numerical operations, SciPy and statsmodels for statistical analysis, and Matplotlib for data visualization. 

pandas adopts NumPy's array-oriented computing paradigm, prioritizing array functions and avoiding explicit loops for data processing.


## Pandas Installation

You can install pandas with the following command:

```bash
!pip install pandas
```

However, just like with NumPy, I took the liberty of installing Pandas on the Jupyter Hub.

## Importing Pandas

We'll need to import pandas into your notebook/script before using it.

The standard convention for importing Pandas is as follows:

```python
import pandas as pd
```

Let's go ahead import pandas into our notebooks in the cell below:

In [2]:
# Import Pandas
import pandas as pd
import random


## pandas Data Structures
<center><img src="../images/stock/pexels-jeffrey-czum-254391-2346289.jpg" width="500"></center>


### Series

Think of a pandas Series as a single column of data with labels.

* __Default Labels__: By default, each item gets a number based on its position, just like in a Python list.
* __Custom Labels__: You can also give each item your own specific label (like a name or an ID). These labels can be numbers, text, or even combinations.
* __Data Types__: A Series can hold various types of data, but it's most efficient when all the items in a single Series are of the same type. This is important because, as we'll see later, Series become the columns in a DataFrame, and columns ideally have consistent data types.

#### Creating a Series from a List

We can easily create a pandas Series from a one-dimensional dataset, like a Python list, using the `pd.Series()` constructor.

For example, let's take this list and turn it into a Series:

In [None]:
# Data
popular_shows = [
    "Stranger Things",
    "The Mandalorian",
    "The Queen's Gambit",
    "Bridgerton",
    "Squid Game",
    "Succession",
    "Ted Lasso",
    "The Witcher",
    "Euphoria",
    "Ozark"
]

# Transform into a Series using pd.Series()


# Output the Series



* Passing a list to `pd.Series()` creates a Series with automatic numeric labels (0, 1, 2, ...).
* The `.dtype` attribute tells us the data type of the elements within the Series.
* Text data (strings) are typically represented as the `object` dtype by default.

#### Custom Labels

You can specify your own labels for the Series using the index argument within the `pd.Series()` function. For example:

In [None]:
# Data
popular_movies = [
    "Oppenheimer",
    "Barbie",
    "The Godfather",
    "Parasite",
    "Spirited Away"
]

# Transform into a Series with Custom Indices
indices = list(range(100,105))





Now, the Series uses the custom labels we provided instead of the default numbers.

#### Accessing Data in a Series

##### Index Operator `[]`
Similar to Python lists, we can retrieve data from a Series using square brackets `[]` with the label or position.

For example: Which popular movie is associated with the label `103`?

In [None]:
# Access the element at label 103




##### `.loc` Accessor

Another way to access Series data by its label is using the `.loc` attribute. For example:


In [None]:
# Access the element at label 104




##### `.iloc` Accessor

Even with custom labels, you can still access elements by their numerical position (like in a list) using the `.iloc` attribute. 

For example:

In [None]:
# Access the element at position 0




##### Slicing Series Data

We can select multiple elements from a Series using slicing with the index operator `[]`, `.loc` (label-based slicing), and `.iloc` (position-based slicing). 

Here's how:

In [None]:
# Demonstrate [] slicing


# Demonstrate .loc slicing


# Demonstrate .iloc slicing



### DataFrame

<center><img src="../images/stock/pexels-suki-lee-110686949-16200703.jpg" width="500"></center>
A pandas DataFrame is like a table with rows and columns. It's a 2D structure where each column can hold different types of data. Think of it as a collection of Series, all sharing the same row labels. Each column is essentially a Series.

#### Creating a DataFrame

##### `pd.DataFrame()`

We use the `pd.DataFrame()` function to create a pandas DataFrame.

In [12]:
# Synthetic data
data = {
    'Name': ['TechGuru', 'FashionDiva', 'GameMaster', 'FoodieFun', 'TravelBug', 'MusicMania', 'BeautyQueen', 'DIYExpert', 'SportsFan', 'ComedyKing'],
    'Subscribers': [1500000, 2300000, 1800000, 1200000, 950000, 2700000, 1100000, 1600000, 2000000, 1400000],
    'Views': [120000000, 250000000, 180000000, 90000000, 60000000, 300000000, 80000000, 140000000, 220000000, 100000000],
    'Category': ['Tech', 'Fashion', 'Gaming', 'Food', 'Travel', 'Music', 'Beauty', 'DIY', 'Sports', 'Comedy'],
    'Country': ['USA', 'Canada', 'UK', 'USA', 'Australia', 'USA', 'USA', 'Canada', 'USA', 'UK'],
    'DateStarted': ['2021-01-01', '2020-05-15', '2019-11-01', '2022-03-10', '2018-09-20', '2023-02-01', '2021-07-01', '2020-10-01', '2019-04-01', '2022-01-01'] # Added DateStarted
}

# Create the DataFrame





Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
0,TechGuru,1500000,120000000,Tech,USA,2021-01-01
1,FashionDiva,2300000,250000000,Fashion,Canada,2020-05-15
2,GameMaster,1800000,180000000,Gaming,UK,2019-11-01
3,FoodieFun,1200000,90000000,Food,USA,2022-03-10
4,TravelBug,950000,60000000,Travel,Australia,2018-09-20
5,MusicMania,2700000,300000000,Music,USA,2023-02-01
6,BeautyQueen,1100000,80000000,Beauty,USA,2021-07-01
7,DIYExpert,1600000,140000000,DIY,Canada,2020-10-01
8,SportsFan,2000000,220000000,Sports,USA,2019-04-01
9,ComedyKing,1400000,100000000,Comedy,UK,2022-01-01


__Note:__

* Jupyter has a neat feature where if the last thing in a cell is a DataFrame, it'll display as an HTML table without needing `print()`, which gives you a cleaner look than the standard text output.

In [None]:
##### `df.head()`

The `df.head()` method returns the first five rows of the DataFrame.

In [4]:
# Demonstration





Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
0,TechGuru,1500000,120000000,Tech,USA,2021-01-01
1,FashionDiva,2300000,250000000,Fashion,Canada,2020-05-15
2,GameMaster,1800000,180000000,Gaming,UK,2019-11-01
3,FoodieFun,1200000,90000000,Food,USA,2022-03-10
4,TravelBug,950000,60000000,Travel,Australia,2018-09-20


##### `df.tail()`

The `df.tail()` method returns the last five rows of the DataFrame.

In [5]:
# Demonstration





Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
5,MusicMania,2700000,300000000,Music,USA,2023-02-01
6,BeautyQueen,1100000,80000000,Beauty,USA,2021-07-01
7,DIYExpert,1600000,140000000,DIY,Canada,2020-10-01
8,SportsFan,2000000,220000000,Sports,USA,2019-04-01
9,ComedyKing,1400000,100000000,Comedy,UK,2022-01-01


##### Specifying Column Order

By providing an ordered sequence of column names, you can control the order in which the columns appear in the resulting DataFrame.

To specify the order of columns when creating a DataFrame (e.g., from a dictionary or list of lists), you pass a list of the desired column names to the columns parameter:

```python
pd.DataFrame(data, columns=[
    'column_name1', 
    'column_name2', 
    'column_name3'
])
```

Similarly, to reorder existing columns, you can reassign the DataFrame with the desired column order:

```python
df = df[[
    'column_name2', 
    'column_name1', 
    'column_name3'
]]
```

In [9]:
# Demonstration





Unnamed: 0,Name,Subscribers,Views,Category,Country,DateStarted
0,TechGuru,1500000,120000000,Tech,USA,2021-01-01
1,FashionDiva,2300000,250000000,Fashion,Canada,2020-05-15
2,GameMaster,1800000,180000000,Gaming,UK,2019-11-01
3,FoodieFun,1200000,90000000,Food,USA,2022-03-10
4,TravelBug,950000,60000000,Travel,Australia,2018-09-20
5,MusicMania,2700000,300000000,Music,USA,2023-02-01
6,BeautyQueen,1100000,80000000,Beauty,USA,2021-07-01
7,DIYExpert,1600000,140000000,DIY,Canada,2020-10-01
8,SportsFan,2000000,220000000,Sports,USA,2019-04-01
9,ComedyKing,1400000,100000000,Comedy,UK,2022-01-01


##### `df.describe()`

The `df.describe()` method is a powerful tool for quickly understanding the distribution of your numerical data within a DataFrame. When called, it computes and summarizes several key statistical measures for each numerical column:

* __count__: The number of non-missing (non-NaN) values.
* __mean__: The average value.
* __std__: The standard deviation, a measure of the spread or dispersion of the data.
* __min__: The minimum value.
* __max__: The maximum value.
* __25% (Q1)__: The first quartile, meaning 25% of the data falls below this value.
* __50% (Median or Q2)__: The middle value; 50% of the data is below and 50% is above.
* __75% (Q3)__: The third quartile, meaning 75% of the data falls below this value.

This output provides a concise overview of the central tendency, dispersion, and shape of the numerical data in your DataFrame.

In [14]:
# Demonstration




Unnamed: 0,Subscribers,Views
count,10.0,10.0
mean,1655000.0,154000000.0
std,552996.89,80443216.69
min,950000.0,60000000.0
25%,1250000.0,92500000.0
50%,1550000.0,130000000.0
75%,1950000.0,210000000.0
max,2700000.0,300000000.0


##### Index Column

By default, DataFrames have a numerical index starting from zero, just like Series.

However, you can set one or more of your existing columns as the DataFrame's index.

For example, let's use the `DateStarted` column as the new index:

In [15]:
# Set 'DateStarted' as the index





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         10 non-null     object
 1   Subscribers  10 non-null     int64 
 2   Views        10 non-null     int64 
 3   Category     10 non-null     object
 4   Country      10 non-null     object
 5   DateStarted  10 non-null     object
dtypes: int64(2), object(4)
memory usage: 612.0+ bytes


After setting `DateStarted` as the index in our YouTuber example, the row labels are now date values.

Remember, DataFrame indexes in pandas can be of any data type. While integers and strings are most frequent, you can also use more complex types like lists, tuples, or even arbitrary Python objects as your index.

#### Combining Series into a DataFrame

You can create a DataFrame by putting multiple pandas Series together. Each Series will become a column in the resulting DataFrame.

In [None]:
# Synthetic Produce Data
produce_names = pd.Series(['Apple', 'Banana', 'Carrot', 'Date', 'Eggplant'])
quantities = pd.Series(list(range(50, 251, 50)))
prices = pd.Series(list(round(random.uniform(0.1, 3.0,),2) for _ in range(5)))

# Create Produce DataFrame




# Output Data Frame





__How it Works__

In that example:

* We used `pd.DataFrame()` to build the DataFrame.
* We provided a Python dictionary.
    * Keys: The dictionary keys became the labels for each column in the DataFrame.
    * Values: The dictionary values were the pandas Series, and these Series provided the actual data for their respective columns.
* Essentially, each Series in the dictionary transformed into a column in the DataFrame.

#### Creating a DataFrame from an API

Want to know how Tesla's been performing? Let's grab their stock data from the past year using the Yahoo Finance API and the `yfinance` library. This will give us a DataFrame to analyze.

First, we need to install the library:

```bash
!pip install yfinance
```

In [None]:
# Install yfinance 


Next, let's import the necessary yfinance library:

```python
import yfinance as yf
```

In [None]:
# Import yfinance


Now, we'll use the `yfinance.download()` function to fetch one year of Tesla's stock data.

The basic format is:

```python
yf.download(_tickers_, _period_)
```

For Tesla, the ticker symbol is `TSLA`, and we want data for the past five years (`5y`).

For more details on the yfinance API, you can check out the official documentation: (The yfinance API Reference)[https://yfinance-python.org/reference/]

In [None]:
# Get Tesla Data



