# Python for Data Science
## Session 6
### Datasets – Pandas I 

---

## Outline

1. Pandas

2. Loading and exploring datasets 

3. Data cleaning and preprocessing with Pandas 

---

## Pandas I

**Pandas** is one of the most used libraries within the Data Science community. It provides a full set of tools to work with:
- 2D data via DataFrame class
    - SQL/Spreadsheet-like datasets (tabular data)
    - Arbitrary matrix data with row and column labels
    - Any type of dataset with observational / statistical data (no labels needed)
- 1D data via Series class
    - Time series data


## Pandas I

As one can see, from the type of datasets **Pandas** can handle, most of the use cases found in Data Science are covered, e.g. finance, health, biology, supply chain, or meteorology. Things **Pandas** do:

1. Handling missing data
2. Change DataFrame size, adding and removing columns and rows at will
3. Automatic data alignment of misaligned data
4. Group by operation
5. Data conversion
6. Advanced indexing
7. Data Merging and joining 
8. Reshaping data
9. Hierarchical indexing
10. Read/write support for CSV, Excel, databases, and fast HDF5 format.
11. Time series manipulation, frequency convertion, window moving statistics


## Pandas I

Let's create a few dataframes using different data structures and see how to start navigating them


In [None]:
## Let's create a simple DataFrame to work with
import pandas as pd
dataset = [
    {
        "name": "Amelie",
        "age": 35
    },
    {
        "name": "Edgar",
        "age": 32
    }
]
df = pd.DataFrame(dataset)
df

In [None]:
# Using a slightly different data struture
data = {'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor'],
        'Age': [24, 27, 22, 32],
        'Country': ['FR', 'FR', 'ES', 'GE']}
df = pd.DataFrame(data)
df

In [None]:
# we can quickly visualize all the columns a dataframe contains by simply
df.columns

In [None]:
# Another important thing we can do is to set up any specific index
# these act as row labels
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
df

In [None]:
# Similar to lists and arrays, we can slice
df[:2]

In [None]:
# Similar to lists and arrays, we can also access elements making use of loc and iloc
df.loc['B':'D']

In [None]:
# We can access elements using loc and iloc
df.iloc[1:]

In [None]:
# We can also access element attributes by using at
df.at[0, 'Age']

In [None]:
# if we wanted to do it with iloc, we would need to pass positions
df.iloc[0,1]

In [None]:
# Important: to modify a specific value we can use at, loc and iloc
df.loc['B', 'Age'] = 123
df

In [None]:
# we can also check what elements satisfy some criteria by column (attribute)
df['Age'] > 25

In [None]:
# and show them
df[df['Age'] > 25]

In [None]:
# or modify a simple attribute of those that satisfy certain condition
df.loc[df['Age'] > 25, 'Name'] = 'Unknown' # df.iloc[df['Age'] > 25, 0] = 'Unknown'
df

In [None]:
# we can also, similar to the way we do with dicts, add new columns
df['Residency'] = [True, False, True, False]

In [None]:
# or add new elements
df.loc[6] = ['Jordi', 23, 'ES', False]
df.loc[9] = ['Anna', 19, 'ES', False]
df

In [None]:
# There are other methods one can also call
df.head(2)

In [None]:
# There are other methods one can also call
df.tail(2)

In [None]:
df.shape # to get the dataframe shape

In [None]:
df.dtypes

In [None]:
df['Age'].astype('float')

In [None]:
# Same as numpy, we can find the unique method
df['Country'].unique()

## Pandas I

Among the different methods two important ones are **info** and **describe**:

In [None]:
df.info()

In [None]:
df.describe()

## Pandas I

And how do we know about missing elements and handle them?

In [None]:
# What about missing values? Missing values are usually represented as NaN (Not a Number)
import numpy as np
data = {'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor'],
        'Age': [24, 27, 22, 32],
        'Country': ['FR', np.nan, 'ES', 'GE']}
df = pd.DataFrame(data)
df

In [None]:
# we can simply know which ones present missing values by using isna or isnull
df.isna() # at the whole dataframe level

In [None]:
# and at the column level
df['Country'].isnull() # same as .isna()

In [None]:
# Simple way of filling these ones is achieved by calling fillna
df.fillna('RE')

In [None]:
# There's also possible thing you can do, use the mode (most common value)
#df['Country'].mode()
df.fillna(df['Country'].mode()[0])

In [None]:
# IMPORTANT: if we check again the dataframe, we will see that the dataframe keeps having NaNs
# This is because the intented behaviour is usually to keep the DataFrame immutable
# If you want to actually modify you can pass ot fillna, inplace=True
df

In [None]:
# A more drastic operation is to drop any column or row with missing values using
# here, we will modify the original dataframe and drop the rows and cols with nans
df.at[0, 'Age'] = np.nan
df

In [None]:
df.dropna()

## Pandas I

One of the handy tools you have in **Pandas** is **groupby**. It allows you to group by any unique attributes within a column and get statistics:
- count
- mean
- sum
- min and max
- multiple aggregations
- group using multiple columns

In [None]:
data = {
    'Name': ['Amelie', 'Edgar', 'Carlos', 'Victor', 'Sofia', 'Jin', 
             'Marta', 'Ali', 'Emily', 'Ravi', 'Chen', 'Fatima', 'Saham'],
    'Age': [24, 27, 22, 32, 29, 31, 28, 26, 23, 34, 25, 30, 26],
    'Country': ['FR', 'US', 'ES', 'GE', 'PT', 'KR',
                'ES', 'AE', 'US', 'IN', 'CN', 'AE', 'AE'],
    'Salary': [70000, 110000, 65000, 82000, 48000, 39000, 45000,
            90000, 97000, 31000, 49000, 85000, 80000]   
}

df = pd.DataFrame(data)
df

In [None]:
# let's count how many we have per country
df.groupby('Country').count()

In [None]:
df.groupby('Country')['Age'].mean() # now let's know the average age per country

In [None]:
df.groupby('Country')['Salary'].sum() # total sum of their salaries

In [None]:
# Multiple Aggregations
df.groupby('Country').agg({'Salary': ['min', 'max'], 'Age': 'mean'})

In [None]:
# grouping using more than one column
df.groupby(['Country', 'Age'])['Salary'].count()

## Pandas I

Let's load *Netflix* titles and do some exercices

In [None]:
# Download from Moodle the zip file containing the netflix dataset
path = 'session_6/netflix_titles.csv'

df = pd.read_csv(path)

In [None]:
df.columns

## Pandas I
Let's do the following exercices:

1. Count Missing Values in Each Column

2. Fill Missing 'country' Values with "Unknown"

3. Filter for TV Shows Only

4. Count the Number of Entries per Rating

5. Add a Column Showing Content Age (how many years since it came out)


## Pandas I

Let's now load the *Titanic* dataset and practice a little bit more:

1. Count the Missing Values in Each Column

2. Fill Missing 'Age' Values with the Mean Age

3. Fill Missing 'Embarked' Values with the Mode (Most Common Value)

4. Filter and Display Passengers Who Paid a Fare Above the Average Fare

5. Add a New Column Indicating Family Size. Create a new column 'FamilySize' as the sum of 'SibSp' (siblings/spouses) and 'Parch' (parents/children)

In [None]:
path = 'session_6/train_and_test2.csv'

df = pd.read_csv(path)

## Pandas I

Home exercises for Netflix:

1. Is there any missing rating?
2. How many films in 2021 correspond to your country?
3. What's the number of movies in 2020 with full information?
4. Give me the year with more titles,
5. and what has been the average in terms of releases from 2010. 

And for Titanic:

1. Calculate Gender-Based Survival Percentage

2. Calculate Survival Percentage Grouped by Gender and Class