# Python Pandas

**[Pandas](https://pandas.pydata.org/)** is an open-source, BSD-licensed Python library. Python package providing fast, flexible, handy, and expressive data structures tool designed to make working with 'relational' or 'labeled' data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world complex data analysis in Python.

pandas is well suited for many different kinds of data:

* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data with row and column labels
* Any other form of observational / statistical data sets.

# Python DataFrame

In this lesson, you will learn pandas DataFrame. It covers the basics of DataFrame, its attributes, functions, and how to use DataFrame for Data Analysis.

DataFrame is the most widely used data structure in Python pandas. You can imagine it as a table in a database or a spreadsheet.

Imagine you have an automobile showroom, and you want to analyze cars’ data to make business strategies. For example, you need to check how many vehicles you have in your showroom of type sedan, or the cars that give good mileage. For such analysis pandas DataFrame is used.

## What is DataFrame in Pandas

Dataframe is a tabular(rows, columns) representation of data. It is a two-dimensional data structure with potentially heterogeneous data.

Dataframe is a size-mutable structure that means data can be added or deleted from it, unlike data series, which does not allow operations that change its size.

<div>
<img src="img/dataframe.png" width="600"/>
</div>

## DataFrame creation

Data is available in various forms and types like CSV, SQL table, JSON, or Python structures like list, dict etc. We need to convert all such different data formats into a DataFrame so that we can use pandas libraries to analyze such data efficiently.

To create DataFrame, we can use either the DataFrame constructor or pandas built-in functions. Below are some examples.

### DataFrame constructor

```python
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
```

#### Parameters:

* **`data`**: It takes input **`dict`**, **`list`**, **`set`**, **`ndarray`**, **`iterable`**, or DataFrame. If the input is not provided, then it creates an empty DataFrame. The resultant column order follows the insertion order.


* **`index`**: (Optional) It takes the list of row index for the DataFrame. The default value is a range of integers 0, 1,…n.


* **`columns`** : (Optional) It takes the list of columns for the DataFrame. The default value is a range of integers 0, 1,…n.


* **`dtype`**: (Optional) By default, It infers the data type from the data, but this option applies any specific data type to the whole DataFrame.


* **`copy`**: (Optional) Copy data from inputs. Boolean, Default False. Only affects DataFrame or 2D array-like inputs

### Dataframe from dict

When we have data in **`dict`** or any default data structures in Python, we can convert it into DataFrame using the DataFrame constructor.

To construct a DataFrame from a **`dict`** object, we can pass it to the DataFrame constructor **`pd.DataFrame(dict)`**. It creates DataFrame using, where **`dict`** keys will be column labels, and **`dict`** values will be the columns’ data. We can also use **`DataFrame.from_dict()`** function to **[Create DataFrame from dict](https://github.com/milaan9/10_Python_Pandas_Module/blob/main/001_Python_Pandas_Methods/001_Python_Pandas_DataFrame_from_Dictionary.ipynb)**.

**Key and Imports:**

| Operator | Description |
|:----: |:---- |
| **`df`** | **pandas DataFrame object** |
| **`s`**  | **pandas Series object** |

In [2]:
# Example:

student_dict = {'Name':['Joe','Nat'], 'Age':[20,21], 'Marks':[85.10, 77.80]}
student_dict

{'Name': ['Joe', 'Nat'], 'Age': [20, 21], 'Marks': [85.1, 77.8]}

**'Name'**, **'Age'** and **'Marks'** are the keys in the **`dict`** when you convert they will become the column labels of the DataFrame.

In [3]:
import pandas as pd

# Python dict object
student_dict = {'Name': ['Joe', 'Nat'], 'Age': [20, 21], 'Marks': [85.10, 77.80]}
print(student_dict)

# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)

{'Name': ['Joe', 'Nat'], 'Age': [20, 21], 'Marks': [85.1, 77.8]}
  Name  Age  Marks
0  Joe   20   85.1
1  Nat   21   77.8


In [4]:
# Example

import pandas as pd

# We pass a dict of {column name: column values}
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print(df)

    X   Y   Z
0  78  84  86
1  85  94  97
2  96  89  96
3  80  83  72
4  86  86  83


In [5]:
# Example

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [True, True, False],
                   'C': [0.496714, -0.138264, 0.647689]},
                  index=['a', 'b', 'c'])  # also this weird index thing
df

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264
c,3,False,0.647689


### Indexing

Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python **`__getitem__`** machinery. Pass in a single column label **`'A'`** or a list of labels **`['A', 'C']`** to select subsets of the original **`DataFrame`**.

In [11]:
# Single column, reduces to a Series
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

In [12]:
type(df['A'])

pandas.core.series.Series

In [14]:
cols = ['A', 'C']
df[cols]

Unnamed: 0,A,C
a,1,0.496714
b,2,-0.138264
c,3,0.647689


In [15]:
type(df[cols])

pandas.core.frame.DataFrame

In [16]:
df.loc[['a', 'b']]

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


In [None]:
df.loc['a':'b']

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


In [17]:
df.iloc[[0, 1]]

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


In [None]:
df.iloc[:2]

Unnamed: 0,A,B,C
a,1,True,0.496714
b,2,True,-0.138264


In [18]:
df.loc['a', 'B']

True

In [19]:
df.loc['a':'b', ['A', 'C']]

Unnamed: 0,A,C
a,1,0.496714
b,2,-0.138264


#### Summary

- Use **`[]`** for selecting columns
- Use **`.loc[row_lables, column_labels]`** for label-based indexing
- Use **`.iloc[row_positions, column_positions]`** for positional index

I've left out boolean and hierarchical indexing, which we'll see later.

## Series

You've already seen some **Series** up above. It's the 1-dimensional analog of the DataFrame. Each column in a **DataFrame** is in some sense a **Series**. You can select a **Series** from a DataFrame in a few ways:

In [None]:
# __getitem__ like before
df['A']

a    1
b    2
c    3
Name: A, dtype: int64

In [None]:
# .loc, like before
df.loc[:, 'A']

a    1
b    2
c    3
Name: A, dtype: int64

In [None]:
# using `.` attribute lookup
df.A

a    1
b    2
c    3
Name: A, dtype: int64

In [20]:
df['mean'] = ['a', 'b', 'c']
df

Unnamed: 0,A,B,C,mean
a,1,True,0.496714,a
b,2,True,-0.138264,b
c,3,False,0.647689,c


In [None]:
df.mean

<bound method NDFrame._add_numeric_operations.<locals>.mean of    A      B         C mean
a  1   True  0.496714    a
b  2   True -0.138264    b
c  3  False  0.647689    c>

In [None]:
df['mean']

a    a
b    b
c    c
Name: mean, dtype: object

In [None]:
# Create DataSeries:

import pandas as pd
s = pd.Series([2, 4, 6, 8, 10])
print(s)

0     2
1     4
2     6
3     8
4    10
dtype: int64


You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on **DataFrame**. The **`.`** accessor is extremely convient for interactive use though.

You should never *assign* a column with **`.`** e.g. don't do

```python
# bad
df.A = [1, 2, 3]
```

It's unclear whether your attaching the list **`[1, 2, 3]`** as an attribute of **`df`**, or whether you want it as a column. It's better to just say

```python
df['A'] = [1, 2, 3]
# or
df.loc[:, 'A'] = [1, 2, 3]
```

**Series** share many of the same methods as **DataFrame**s.

## Index

**`Index`** are something of a peculiarity to pandas.
First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.
In pandas, **`Index`** are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two **DataFrames** or **Series**.

R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a **DataFrame** or **Series** with the **`.index`** attribute.

In [21]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
df.columns

Index(['A', 'B', 'C', 'mean'], dtype='object')

There are special kinds of `Index`es that you'll come across. Some of these are

- **`MultiIndex`** for multidimensional (Hierarchical) labels
- **`DatetimeIndex`** for datetimes
- **`Float64Index`** for floats
- **`CategoricalIndex`** for, you guessed it, **Categoricals**

* **How do you slice a DataFrame by row label?**
  - Use **`.loc[label]`**. For position based use **`.iloc[integer]`**.
* **How do you select a column of a DataFrame?**
  - Standard **`__getitem__`**: **`df[column_name]`**
* **Is the Index a column in the DataFrame?**
  - No. It isn't included in any operations (**`mean`**, etc). It can be inserted as a regular column with **`df.reset_index()`**.