# <center> Introduction to Pandas

> Pandas is a Python library used for working with data sets.

It is used for:
- analyzing
- cleaning
- exploring
- manipulating data.

"Pandas" comes from "Python Data Analysis" 

It was created by **Wes McKinney** in 2008

Pandas is built on top of the **NumPy** package, meaning a lot of the structure of NumPy is used or replicated in Pandas.

### What is Numpy???

![question](resources/questions_dog.gif)

**In short:**

> Numpy is a package built to make Python numeric computations faster.

> It introduces a new data structure: array and ndarray 

> A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers

> An array is equivalent to a list, but has several advantages in terms of performance:

- **Size** - Numpy data structures take up less space
- **Performance** - Numpy data structures are faster than lists
- **Functionality** - NumPy have optimized functions such as linear algebra operations built in.

### Back to pandas

Pandas introduces 2 main data structures:
    
> pandas.Series

> pandas.Dataframe

![question](resources/series_dataframes.png)

**pandas.Series:**
    
A Pandas Series is a one-dimensional array holding data of any type.

It is like a column in a table.

**pandas.Dataframe:**

A Pandas DataFrame is a 2 dimensional data structure

It is like a table with rows and columns.



# Pandas Dataframe Hands-On

## How to create a Dataframe

### 1. From a Dict

In [1]:
# We need to import pandas once
import pandas as pd

In [2]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,name,age
0,Jane,50
1,Bill,40
2,John,45


### 2. From a List

In [3]:
data_list = [[1, 2, 3, 4],
             [5, 6, 7, 8],
             [9, 10, 11, 12],
             [13, 14, 15, 16],
             [17, 18, 19, 20]]

df = pd.DataFrame(data_list, columns=["col1", "col2", "col3", "col4"])
df

Unnamed: 0,col1,col2,col3,col4
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


### 3. From a csv file

In [4]:
df = pd.read_csv('./resources/data.csv')

df

Unnamed: 0,name,age
0,Jane,32
1,Bob,43
2,Maggy,22


### 4. From a json file

In [5]:
df = pd.read_json('./resources/data.json')

df

Unnamed: 0,name,age
0,Tom,64
1,Paul,43
2,James,30


### 5. From a SQL table

```python
db_connection = connect_db(url, user, password)

df = pd.read_sql_query("SELECT * FROM table", db_connection)

```

## Get informations about a Dataframe

In [6]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

### Dataframe columns

In [7]:
df.columns

Index(['name', 'age'], dtype='object')

### The data types of its Dataframe

All elements of a column have the same data type


In [8]:
df.dtypes

name    object
age      int64
dtype: object

### Get only one colum of the Dataframe

The output is a Serie of a given data type


In [9]:
name_col = df.name
name_col

0    Jane
1    Bill
2    John
Name: name, dtype: object

In [10]:
# The type of name_col  is pd.Series
print(type(name_col))

# The type of the element of name_col is object (which means string)
print(name_col.dtype)

<class 'pandas.core.series.Series'>
object


### Get meta information about the dataframe

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


### Get the number of rows and columns

In [31]:
df.shape

(3, 2)

In [32]:
number_of_rows = df.shape[0]
number_of_cols = df.shape[1]

### Get statistics about the dataframe

In [12]:
df.describe() # COMPUTE THE STATS FOR NUMERIC COLUMNS ONLY

Unnamed: 0,age
count,3.0
mean,45.0
std,5.0
min,40.0
25%,42.5
50%,45.0
75%,47.5
max,50.0


In [13]:
# We can also compute some specific stats fora given column
print(df.age.mean())
print(df.age.count())
print(df.age.std())
print(df.age.min())
print(df.age.quantile(0.25)) 
print(df.age.quantile(0.5))
print(df.age.quantile(0.75))
print(df.age.max())

45.0
3
5.0
40
42.5
45.0
47.5
50


### Display first or last rows

In [14]:
df.head(2)   # Displays first n rows, defualt is 5

Unnamed: 0,name,age
0,Jane,50
1,Bill,40


In [15]:
df.tail(2)   # Displays last n rows, defualt is 5

Unnamed: 0,name,age
1,Bill,40
2,John,45


### Get a column's unique values

In [16]:
data_dup = {
  "name": ['Jane', 'Bill', 'John', 'Jane'],
  "age": [50, 40, 40, 30]
}

# load data into a DataFrame
df_with_duplicates = pd.DataFrame(data_dup)

df_with_duplicates

Unnamed: 0,name,age
0,Jane,50
1,Bill,40
2,John,40
3,Jane,30


In [17]:
df_with_duplicates.age.unique()

array([50, 40, 30])

In [18]:
df_with_duplicates.name.unique()

array(['Jane', 'Bill', 'John'], dtype=object)

### Sort the dataframe by a column's value

In [21]:
df.sort_values(by="age")

Unnamed: 0,name,age
0,Jane,50
1,Bill,40
2,John,45


In [22]:
df.sort_values(by="name")

Unnamed: 0,name,age
1,Bill,40
0,Jane,50
2,John,45


### Transpose the dataframe

In [26]:
df.T

Unnamed: 0,0,1,2
name,Jane,Bill,John
age,50,40,45


## Your turn to play

![your_turn](resources/your_turn.gif)

### Exercice:

- Create a dataframe from the file stored in **./resources/users.csv**
- Display the first 5 rows
- Show the data type of each columns
- Show the size (number of rows)
- Give some statistics about the numeric columns
- Give some statistics about the numeric columns
- Explore...

In [4]:
import pandas as pd
df = pd.read_csv('./resources/users.csv')
df.head()

Unnamed: 0,user_uuid,first_name,birthday,city,country,is_new_user
0,27859,Élodie Le Roux,1968-03-10,Berger-les-Bains,Émirats arabes unis,True
1,31111,Margot Deschamps,1983-09-25,Clément-sur-Mer,Géorgie,False
2,25356,Bernard Lopes,1991-09-07,Danielboeuf,Trinité et Tobago,True
3,34248,Dominique Vaillant de Delannoy,1986-12-02,Saint Dominique,Slovaquie,True
4,25123,Pierre Paul,1938-12-13,Sainte AdrienneBourg,Israël,False
