In [None]:
%pip install pandas scikit-learn
%pip freeze > ../requirements.txt

<header style="background: white; border-top: 8px solid #602663; padding: 1em;">
<div>
<span style="color: black; font-size: medium; font-weight: 700; text-transform: uppercase;">Level 6 Data Science / Software Engineering</span><br><span style="color: #602663; font-size: xxx-large; font-weight: 900;">Topic 3 &mdash; Introduction to Pandas</span>
</div>
</header>

## The Problem: Handling Complex Data

In data science, we often encounter large, complex datasets that can be challenging to work with. These datasets may come from various sources and in different formats, such as:

- CSV files
- Excel spreadsheets
- SQL databases
- JSON files
- Web APIs

Working with this data efficiently and effectively presents several challenges:

1. **Data cleaning**: Real-world data is often messy, containing missing values, duplicates, or inconsistencies.
2. **Data manipulation**: Reshaping, merging, and transforming data to suit our analysis needs.
3. **Data analysis**: Performing calculations, aggregations, and statistical operations on large datasets.
4. **Performance**: Processing large amounts of data quickly and efficiently.

## Pandas: Your Data Analysis Swiss Army Knife

[Pandas](https://pandas.pydata.org) is a powerful, open-source library for Python that addresses these challenges and more. It provides high-performance, easy-to-use data structures and tools for data manipulation and analysis.

### Key Features of Pandas:

1. **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types.
2. **Series**: A one-dimensional labeled array that can hold data of any type.
3. **Efficient data manipulation**: Tools for reading, writing, and transforming data.
4. **Handling of missing data**: Built-in support for handling missing data.
5. **Merging and joining datasets**: Combining multiple datasets easily.
6. **Time series functionality**: Tools for working with date and time data.
7. **Integration with other libraries**: Works well with [NumPy](https://numpy.org), [Matplotlib](https://matplotlib.org), and [scikit-learn](https://scikit-learn.org).

### The Role of Pandas in Data Analysis

Pandas serves as a crucial tool in the data analysis pipeline:

1. **Data Loading**: Pandas can read data from various file formats and databases.
2. **Data Cleaning**: It provides functions to handle missing values, remove duplicates, and format data.
3. **Data Transformation**: Pandas allows you to reshape, merge, and pivot your data.
4. **Data Analysis**: You can perform complex operations, grouping, and aggregations on your data.
5. **Data Visualisation**: While not a visualisation library itself, Pandas integrates well with plotting libraries like Matplotlib.

Let's explore these features hands-on and see how Pandas can make your data analysis tasks more efficient and enjoyable.

## Getting Started

Let's begin by importing Pandas and creating a simple `DataFrame` to explore its basic functionality. We'll use a dictionary to create a `DataFrame` with some sample data.

In [None]:
# Import pandas - by convention it is aliased as `pd`
import pandas as pd

# Create a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Display the DataFrame
df


Let's explore some basic operations that Pandas can perform on this `DataFrame`.

In [None]:
# Access a single column
ages = df['Age']

print(type(ages))

ages

In [None]:
# Access multiple columns
columns = df[['Name', 'City']]

print(type(columns))

columns

In [None]:
# Access a specific row by index
row = df.iloc[1]

print(type(row))

row

In [None]:
# Access a specific cell
cell = df.loc[2, 'Age']

print(type(cell))

cell

In [None]:
# Basic statistics
df['Age'].describe()

In [None]:
# Filtering data
df[df['Age'] >= 30]

In [None]:
# Adding a new column
df['Country'] = 'USA'

df

In [None]:
# Simple data manipulation
df['Age in 5 years'] = df['Age'] + 5

df


These examples demonstrate some fundamental Pandas operations:

1. Creating a `DataFrame` from a dictionary
2. Accessing columns and rows
3. Retrieving basic statistics
4. Filtering data based on conditions
5. Adding new columns
6. Performing simple calculations on columns

As you can see, Pandas makes it easy to work with structured data. In the next sections, we'll explore more advanced features like data loading from files, data cleaning, and more complex manipulations.

## Data Loading

One of the most common tasks in data analysis is loading data from various sources. Pandas provides functions to read data from CSV files, Excel spreadsheets, SQL databases, JSON files, and more. For a complete list of supported file formats, refer to the Pandas [Input/output documentation](https://pandas.pydata.org/docs/reference/io.html).

Let's explore how to load data from a CSV file using Pandas. We'll use the `pd.read_csv()` function to read a CSV file into a `DataFrame`.

In [None]:
import os
import pandas as pd

# Titanic dataset
# path = os.path.join(os.getcwd(), '..','datasets', 'titanic.csv')
url = "https://raw.githubusercontent.com/bpp-sot/l6ds-se-sep23-grp1/refs/heads/main/datasets/titanic.csv"

titanic = pd.read_csv(url)

titanic

Once loaded, we can explore the data, check its structure, and perform various operations on it. This is the first step in the data analysis process: getting the data into a format that we can work with.

In [None]:
# The shape of the DataFrame - a tuple: (rows, columns)
titanic.shape

In [None]:
# Display a summary of the DataFrame
titanic.info()

In [None]:
# Display the first few rows of the DataFrame
titanic.head()

In [None]:
# Display the last few rows of the DataFrame
titanic.tail()

We can use `groupby()` to group the data by a specific column and perform aggregations on it. This is a powerful feature that allows us to summarise and analyse data efficiently.

In [None]:
# Display a random sample of the DataFrame
titanic.sample(5)

In [None]:
# Group by `pclass` and calculate the mean of the `fare` column
titanic.groupby(['pclass'])['fare'].mean()

In [None]:
# Group by `sex` and then `survived`, then count
survivals = titanic.groupby(['sex', 'survived']).size()

survivals

## Data Cleaning

Real-world data is often messy and requires cleaning before analysis. This involves handling missing values, removing duplicates, and fixing inconsistencies in the data. Pandas provides functions to help with these tasks.

### Handling Missing Data

Missing data is a common issue in datasets and can affect the quality of our analysis. Pandas provides several functions to handle missing data, such as `isnull()`, `notnull()`, `dropna()`, and `fillna()`. These functions allow us to identify missing values, remove rows or columns with missing data, or fill in missing values with a specified value.

Let's explore how to handle missing data in a `DataFrame` using Pandas. 

In [None]:
# Find the number of missing values in each column - `isna()` returns a DataFrame of booleans and `sum()` sums the columns
titanic.isna().sum()

> Python will coerce a boolean value into an integer value when needed, so `True` becomes `1` and `False` becomes `0`. This is why we can use the `sum()` function to count the number of missing values (`isna() => True`) in each column.

#### Handling Missing Values - Cabin

There are 1,309 records in this dataset, and there are a significant number of missing values in the `cabin` column.

We can use the `dropna()` function to remove rows with missing values in the `cabin` column, but this would result in losing a large portion of the dataset. Another approach might be to drop the `cabin` column entirely, but we should consider the impact of this decision on our analysis. What if the `cabin` column contains valuable information that we need for our analysis? 

Another option is to fill in the missing values with a placeholder value, such as `'None'`. This approach allows us to retain the information in the `cabin` column while handling the missing data. 


In [70]:
# Replace missing values in the `cabin` column with the string 'None'
titanic['cabin'] = titanic['cabin'].fillna('None')

### Handling Missing Values - Age

The `age` column also has some missing values, but we can fill them in with the mean age of the passengers. This way, we retain the data while handling the missing values.

In [None]:
# Fill missing values in `age` column with the mean
titanic['age'] = titanic['age'].fillna(titanic['age'].mean())

titanic.sample(5)

### Handling Missing Values - Embarked

The `embarked` column has only two missing values, which we can drop without losing much information, but it's also possible to fill them in with the most common value.

In [None]:
# Fill missing values in `embarked` column with the mode
titanic['embarked'] = titanic['embarked'].fillna(titanic['embarked'].mode()[0])

titanic.sample(5)

### Handling Missing Values - Fare

The `fare` column only has one missing value, but we can't simply fill it with eith the mean or median value, as the fare is likely to be related to other factors like the passenger class or the port of embarkation. 

Let's have a look at the row in question:

In [None]:
# Find rows where `fare` is missing
index = titanic[titanic['fare'].isna()].index

titanic.iloc[index]

We can impute the median fare for the group of passengers that were in the same class and embarked from the same port, and use that value to fill in the missing fare value. We use median instead of mean as the fare distribution is unlikely to be continuous and may have outliers.

To get the median fare for each group, we can use the `groupby()` function to group the data by passenger class and port of embarkation, and then calculate the median fare for each group. 

In [None]:
# Group by `pclass` and `embarked` and expose the `fare` column
fare_by_pclass_and_embarked = titanic.groupby(['pclass', 'embarked'])['fare']
fare_by_pclass_and_embarked.median()

To get the fare, we need to index the resulting `fare_by_pclass_and_embarked` object by the passenger class and port of embarkation. The resulting code is a little difficult to read, but here it is:

In [None]:
# Get mean fare for 'pclass' = 3 and 'embarked' = 'S'
pclass = 3
embarked = 'S'

fare_by_pclass_and_embarked.median()[pclass][embarked]

We could do this manually, but we'll create a function to help automate the process.

In [None]:
def impute_fare(row):
    """
    Impute missing values in the `fare` column by taking the meadian of the fares of passengers with the same `pclass` and `embarked` values.

    Parameters:
    row: A row in the DataFrame

    Returns:
    The fare value if it is not missing, otherwise the median fare of passengers with the same `pclass` and `embarked` values
    """
    if pd.isna(row['fare']):
        pclass = row['pclass']
        embarked = row['embarked']
        return titanic.groupby(['pclass', 'embarked'])['fare'].median()[pclass][embarked]
    else:
        return row['fare']

# Apply the `impute_fare` function to the DataFrame
titanic['fare'] = titanic.apply(impute_fare, axis=1)

titanic.iloc[index] # Check that the record with the missing fare has been updated


In [82]:
# Check that there are no missing values
assert titanic.isnull().sum().sum() == 0

Pandas provides other ways of handling missing data, such as [interpolation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html#pandas-dataframe-interpolate), [forward-fill](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html#pandas-dataframe-ffill), and [backward-fill](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html#pandas-dataframe-bfill), depending on the context of the data. For more information, refer to the Pandas [missing data documentation](https://pandas.pydata.org/docs/reference/frame.html#missing-data-handling).

## Adding Features

In data analysis, we often need to create new features from existing data to extract more insights. Pandas makes it easy to add new columns to a `DataFrame` based on existing columns. We can perform calculations, transformations, or aggregations to create new features.

### Adding Features - Title

The `name` column in the Titanic dataset contains both the passenger's name and title. We can extract the title from the name and create a new column called `title` to store this information. This transformation can help us analyse the data based on the passenger's title.

> We're using a regular expression to extract the title from the name. The pattern `([A-Za-z]+)\.` captures one or more letters (`[A-Za-z]+`), followed by a period (`\.`). This pattern extracts the title from the name, such as 'Mr.', 'Mrs.', 'Miss', etc.

In [None]:
# Extract the title from the `name` column
titanic['title'] = titanic['name'].str.extract('([A-Za-z]+)\\.', expand=False)
titanic['title'].value_counts()

In [None]:
# Replace rare titles with 'Rare'
rare_titles = ['Dr', 'Rev', 'Col', 'Major', 'Lady', 'Capt', 'Sir', 'Jonkheer', 'Dona', 'Don', 'Countess']
titanic['title'] = titanic['title'].replace(rare_titles, 'Rare')

# Replace 'Mlle' and 'Ms' with 'Miss', and 'Mme' with 'Mrs'
titanic['title'] = titanic['title'].replace(['Mlle', 'Ms'], 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

titanic['title'].value_counts()

### Adding Features - One Hot Encoding

Categorical variables like `sex` and `embarked` need to be converted into numerical form for machine learning algorithms. One common technique is one-hot encoding, which creates binary columns for each category in the original column. Pandas provides a function called `get_dummies()` to perform one-hot encoding on categorical variables.

In [None]:
# One-hot encoding of `embarked` and `title` columns
categorical_columns = ['embarked', 'title']

titanic = pd.get_dummies(titanic, columns=categorical_columns, drop_first=True)

titanic.head()

### Adding Features - Family Size and Alone

We can create a new feature called `family_size` by combining the `sibsp` (number of siblings/spouses aboard) and `parch` (number of parents/children aboard) columns. This new feature represents the total number of family members aboard with each passenger.

In [None]:
# Add a new column `family_size` by summing the `sibsp` and `parch` columns
titanic['family_size'] = titanic['sibsp'] + titanic['parch']

titanic.head()

We can also create a new feature called `alone` to indicate whether a passenger was traveling alone or with family. This binary feature can help us analyse the survival rates of passengers traveling alone versus those traveling with family.

In [None]:
# Add a new column `alone` that is `True` if `family_size` is 0, otherwise `False`
titanic['alone'] = titanic['family_size'] == 0

titanic.head()

### Adding Features - Deck

We can use the `cabin` to determine the deck of the ship where the passenger's cabin was located. The deck information is encoded in the first character of the `cabin` value (e.g., `C85` corresponds to deck `C`). We can extract this information and create a new column called `deck` to store it.

- a `cabin` value of `'None'` will map to deck 0
- cabin `'A'` will map to deck 1, `'B'` to deck 2, and so on up to `'G'` (deck 7)
- cabins `'T'`, `'U'`, `'W'`, `'X'`, `'Y'`, and `'Z'` were on the boat deck, so we will map them to deck 8 (only one record has a cabin value of `'T'`)
- some records may contain multiple cabin values, separated by a space, in which case we will take the first cabin value to derive the deck

In [None]:
# Extract the deck from the `cabin` column
deck_mapping = {'N':0, 'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7, 'T': 8, 'U': 8, 'W': 8, 'X': 8, 'Y': 8, 'Z': 8}

titanic['deck'] = titanic['cabin'].str[0].map(deck_mapping)

titanic.sample(5)

## Transforming Features

Data transformation involves converting data into a suitable format for analysis. This may include scaling numerical features, encoding categorical variables, or normalising data. Pandas provides functions to help with these transformations.

### Transforming Features - Age

The `age` column contains continuous numerical data that can be normalised to improve the performance of our model. We can use the `StandardScaler` from `scikit-learn` to scale the `age` column to have a mean of 0 and a standard deviation of 1.

In [None]:
# Normalise the `age` column
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

titanic['age'] = scaler.fit_transform(titanic[['age']])

titanic.head()

### Transforming Features - Sex

The `sex` column contains categorical data that can be converted into numerical form using one-hot encoding, but we can also map the values to integers directly.

In [None]:
# Map `sex` column to integer - male -> 0, female -> 1
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1}).astype(int)

titanic.head()

# Removing Unnecessary Columns

After creating new features, we may want to remove unnecessary columns from the dataset to reduce complexity and improve performance. Pandas provides the `drop()` function to remove columns from a `DataFrame`.

In [None]:
# Drop `cabin`, `name`, `parch`, and `sibsp` columns - no longer needed
titanic = titanic.drop(['cabin', 'name', 'parch', 'sibsp'], axis=1)

titanic.sample(5)

In [None]:
# Drop `id``, and `ticket` columns - not useful
titanic = titanic.drop(['id', 'ticket'], axis=1)

titanic.sample(5)

## Writing to a CSV file

Now that we have cleaned and preprocessed the data, we can write it to a CSV file using the `to_csv()` method. It's a good practice to save the cleaned data to a file so that we can reuse it later for analysis or modelling.

In [38]:
# Save the cleaned DataFrame to a new CSV file
titanic.to_csv('titanic_cleaned.csv', index=False)

We can also split the data into training and testing sets before writing them to a file. This allows us to save the training and testing data separately for machine learning tasks.

In [39]:
# Split the data into features and target
X = titanic.drop('survived', axis=1)
y = titanic['survived']

# Save the features and target to separate CSV files
X.to_csv('titanic_features.csv', index=False)
y.to_csv('titanic_target.csv', index=False)