# <center> Introduction to Pandas

> Pandas is a Python library used for working with data sets.

It is used for:
- analyzing
- cleaning
- exploring
- manipulating data.

"Pandas" comes from "Python Data Analysis" 

It was created by **Wes McKinney** in 2008

Pandas is built on top of the **NumPy** package, meaning a lot of the structure of NumPy is used or replicated in Pandas.

### What is Numpy???

![question](resources/questions_dog.gif)

**In short:**

> Numpy is a package built to make Python numeric computations faster.

> It introduces a new data structure: array and ndarray 

> A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers

> An array is equivalent to a list, but has several advantages in terms of performance:

- **Size** - Numpy data structures take up less space
- **Performance** - Numpy data structures are faster than lists
- **Functionality** - NumPy have optimized functions such as linear algebra operations built in.

### Back to pandas

Pandas introduces 2 main data structures:
    
> pandas.Series

> pandas.Dataframe

![question](resources/series_dataframes.png)

**pandas.Series:**
    
A Pandas Series is a one-dimensional array holding data of any type.

It is like a column in a table.

**pandas.Dataframe:**

A Pandas DataFrame is a 2 dimensional data structure

It is like a table with rows and columns.



# Pandas Dataframe Hands-On

## How to create a Dataframe

### 1. From a Dict

In [None]:
# We need to import pandas once
import pandas as pd

In [None]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

df

### 2. From a List

In [None]:
data_list = [[1, 2, 3, 4],
             [5, 6, 7, 8],
             [9, 10, 11, 12],
             [13, 14, 15, 16],
             [17, 18, 19, 20]]

df = pd.DataFrame(data_list, columns=["col1", "col2", "col3", "col4"])
df

### 3. From a csv file

In [None]:
df = pd.read_csv('./resources/data.csv')

df

### 4. From a json file

In [None]:
df = pd.read_json('./resources/data.json')

df

### 5. From a SQL table

```python
db_connection = connect_db(url, user, password)

df = pd.read_sql_query("SELECT * FROM table", db_connection)

```

## Get informations about a Dataframe

In [None]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

### Dataframe columns

In [None]:
df.columns

### The data types of its Dataframe

All elements of a column have the same data type


In [None]:
df.dtypes

### Get only one colum of the Dataframe

The output is a Serie of a given data type


In [None]:
name_col = df.name
name_col

In [None]:
# The type of name_col  is pd.Series
print(type(name_col))

# The type of the element of name_col is object (which means string)
print(name_col.dtype)

### Get meta information about the dataframe

In [None]:
df.info()

### Get the number of rows and columns

In [None]:
df.shape

In [None]:
number_of_rows = df.shape[0]
number_of_cols = df.shape[1]

### Get statistics about the dataframe

In [None]:
df.describe() # COMPUTE THE STATS FOR NUMERIC COLUMNS ONLY

In [None]:
# We can also compute some specific stats fora given column
print(df.age.mean())
print(df.age.count())
print(df.age.std())
print(df.age.min())
print(df.age.quantile(0.25)) 
print(df.age.quantile(0.5))
print(df.age.quantile(0.75))
print(df.age.max())

### Display first or last rows

In [None]:
df.head(2)   # Displays first n rows, defualt is 5

In [None]:
df.tail(2)   # Displays last n rows, defualt is 5

### Get a column's unique values

In [None]:
data_dup = {
  "name": ['Jane', 'Bill', 'John', 'Jane'],
  "age": [50, 40, 40, 30]
}

# load data into a DataFrame
df_with_duplicates = pd.DataFrame(data_dup)

df_with_duplicates

In [None]:
df_with_duplicates.age.unique()

In [None]:
df_with_duplicates.name.unique()

### Sort the dataframe by a column's value

In [None]:
df.sort_values(by="age")

In [None]:
df.sort_values(by="name")

### Transpose the dataframe

In [None]:
df.T

## Your turn to play

![your_turn](resources/your_turn.gif)

### Exercice:

- Create a dataframe from the file stored in **./resources/users.csv**
- Display the first 5 rows
- Show the data type of each columns
- Show the size (number of rows)
- Give some statistics about the numeric columns
- Give some statistics about the numeric columns
- Explore...

In [None]:
import pandas as pd
df = pd.read_csv('./resources/users.csv')
df.head()