# <center> Introduction to Pandas

> Pandas is a Python library used for working with data sets.

It is used for:
- analyzing
- cleaning
- exploring
- manipulating data.

"Pandas" comes from "Python Data Analysis" 

It was created by **Wes McKinney** in 2008

Pandas is built on top of the **NumPy** package, meaning a lot of the structure of NumPy is used or replicated in Pandas.

### What is Numpy???

![question](resources/questions_dog.gif)

**In short:**

> Numpy is a package built to make Python numeric computations faster.

> It introduces a new data structure: array and ndarray 

> A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers

> An array is equivalent to a list, but has several advantages in terms of performance:

- **Size** - Numpy data structures take up less space
- **Performance** - Numpy data structures are faster than lists
- **Functionality** - NumPy have optimized functions such as linear algebra operations built in.

### Back to pandas

Pandas introduces 2 main data structures:
    
> pandas.Series

> pandas.Dataframe

![question](resources/series_dataframes.png)

**pandas.Series:**
    
A Pandas Series is a one-dimensional array holding data of any type.

It is like a column in a table.

**pandas.Dataframe:**

A Pandas DataFrame is a 2 dimensional data structure

It is like a table with rows and columns.



# Pandas Dataframe Hands-On

## How to create a Dataframe

### 1. From a Dict

In [1]:
# We need to import pandas once
import pandas as pd

In [2]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

df

Unnamed: 0,name,age
0,Jane,50
1,Bill,40
2,John,45


### 2. From a List

In [3]:
data_list = [[1, 2, 3, 4],
             [5, 6, 7, 8],
             [9, 10, 11, 12],
             [13, 14, 15, 16],
             [17, 18, 19, 20]]

df = pd.DataFrame(data_list, columns=["col1", "col2", "col3", "col4"])
df

Unnamed: 0,col1,col2,col3,col4
0,1,2,3,4
1,5,6,7,8
2,9,10,11,12
3,13,14,15,16
4,17,18,19,20


### 3. From a csv file

In [4]:
df = pd.read_csv('./resources/data.csv')

df

Unnamed: 0,name,age
0,Jane,32.0
1,Bob,43.0
2,Maggy,22.0
3,Andy,


### 4. From a json file

In [5]:
df = pd.read_json('./resources/data.json')

df

Unnamed: 0,name,age
0,Tom,64
1,Paul,43
2,James,30


### 5. From a SQL table

```python
db_connection = connect_db(url, user, password)

df = pd.read_sql_query("SELECT * FROM table", db_connection)

```

## Get informations about a Dataframe

In [None]:
data = {
  "name": ['Jane', 'Bill', 'John'],
  "age": [50, 40, 45]
}

# load data into a DataFrame
df = pd.DataFrame(data)

### Dataframe columns

In [None]:
df.columns

### The data types of its Dataframe

All elements of a column have the same data type


In [None]:
df.dtypes

### Get only one colum of the Dataframe

The output is a Serie of a given data type


In [None]:
name_col = df.name
name_col

In [None]:
# The type of name_col  is pd.Series
print(type(name_col))

# The type of the element of name_col is object (which means string)
print(name_col.dtype)

### Get meta information about the dataframe

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    3 non-null      object
 1   age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


### Get the number of rows and columns

In [7]:
df.shape

(3, 2)

In [None]:
number_of_rows = df.shape[0]
number_of_cols = df.shape[1]

### Get statistics about the dataframe

In [8]:
df.describe() # COMPUTE THE STATS FOR NUMERIC COLUMNS ONLY

Unnamed: 0,age
count,3.0
mean,45.666667
std,17.156146
min,30.0
25%,36.5
50%,43.0
75%,53.5
max,64.0


In [9]:
# We can also compute some specific stats fora given column
print(df.age.mean())
print(df.age.count())
print(df.age.std())
print(df.age.min())
print(df.age.quantile(0.25)) 
print(df.age.quantile(0.5))
print(df.age.quantile(0.75))
print(df.age.max())

45.666666666666664
3
17.15614564327703
30
36.5
43.0
53.5
64


### Display first or last rows

In [None]:
df.head(2)   # Displays first n rows, defualt is 5

In [None]:
df.tail(2)   # Displays last n rows, defualt is 5

### Get a column's unique values

In [10]:
data_dup = {
  "name": ['Jane', 'Bill', 'John', 'Jane'],
  "age": [50, 40, 40, 30]
}

# load data into a DataFrame
df_with_duplicates = pd.DataFrame(data_dup)

df_with_duplicates

Unnamed: 0,name,age
0,Jane,50
1,Bill,40
2,John,40
3,Jane,30


In [11]:
df_with_duplicates.age.unique()

array([50, 40, 30])

In [12]:
df_with_duplicates.name.unique()

array(['Jane', 'Bill', 'John'], dtype=object)

### Sort the dataframe by a column's value

In [13]:
df.sort_values(by="age")

Unnamed: 0,name,age
2,James,30
1,Paul,43
0,Tom,64


In [15]:
df.sort_values(by="name", ascending=False)

Unnamed: 0,name,age
0,Tom,64
1,Paul,43
2,James,30


### Transpose the dataframe

In [16]:
df.T

Unnamed: 0,0,1,2
name,Tom,Paul,James
age,64,43,30


## Your turn to play

![your_turn](resources/your_turn.gif)

### Exercice:

- Create a dataframe from the file stored in **./resources/users.csv**
- Display the first 5 rows
- Show the data type of each columns
- Show the size (number of rows)
- Give some statistics about the numeric columns
- Give some statistics about the numeric columns
- Explore...

In [19]:
import pandas as pd
df = pd.read_csv('./resources/users.csv', parse_dates=["birthday"])
df.dtypes

user_uuid               int64
first_name             object
birthday       datetime64[ns]
city                   object
country                object
is_new_user              bool
dtype: object

In [23]:
df[~df.is_new_user]

Unnamed: 0,user_uuid,first_name,birthday,city,country,is_new_user
1,31111,Margot Deschamps,1983-09-25,Clément-sur-Mer,Géorgie,False
4,25123,Pierre Paul,1938-12-13,Sainte AdrienneBourg,Israël,False
6,33036,Gilbert Delahaye,1967-04-20,Chauvet,Zimbabwe,False
7,32164,Corinne Hoarau,1984-04-03,Clément-sur-Guibert,Bahamas,False
12,27047,Julien Le De Oliveira,2001-11-12,Bertin,Libye,False
...,...,...,...,...,...,...
986,30516,Isaac Rousset-Thomas,1995-03-27,Thierry-sur-Lemonnier,Indonésie,False
988,31121,Sylvie Gonzalez,1971-05-16,Maréchaldan,Azerbaïdjan,False
993,33289,Caroline Wagner,1993-02-07,Gilles,Nouvelle-Zélande,False
996,33851,Laurent Normand,1980-02-25,Hoarau,Kazakhstan,False
