# Lab 1 - Data Manipulation with Pandas


<div>
<img src="../../images/pandas_logo.png" width="700"/>
</div>

_(Adapted from [CS109a: Introduction to Data Science](https://harvard-iacs.github.io/2019-CS109A/), [Pandas: Getting Started](https://pandas.pydata.org/docs/getting_started/index.html) & [GitHub: pandas_exercises](https://github.com/guipsamora))_


## Table of Contents

1.  Quick Overview
2.  Learning Goals
3.  Loading and Cleaning with Pandas
4.  Parsing and Completing the Dataframe
5.  Group Exercise: 1/2 hour in the Life of a Cardiologist


# 1. Quick Overview


In [None]:
from pathlib import Path
from typing import List

# Initialize a base path for us to use
BASE_PATH = Path().cwd()

BASE_PATH

## How is a DataFrame structured?

<div>
<img src="../../images/pandas_structure.png" width="700"/>
</div>


Getting started with using pandas


In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

df

In [None]:
df["Age"]

When selecting a single column of a pandas **`DataFrame`**, the result is a pandas **`Series`**.


In [None]:
type(df["Age"])

A pandas **`Series`** has no column labels, as it is just a single column of a **`DataFrame`**. A Series does have row labels.


In [None]:
# Access the series by the index (row label)
series = df["Age"]

series.loc[series.index % 2 == 0]

## How do we get data inside a DataFrame?

<div>
<img src="../../images/pandas_read_data.png" width="700"/>
</div>

Pretty simple, just use the (hopefully existing) **`read_<file_extension>`** method:


In [None]:
DATA_PATH = BASE_PATH / "data"

titanic = pd.read_csv(DATA_PATH / "titanic" / "titanic.csv")
titanic

The great thing about this modular approach, is that if we know that the file extension maps one to one to an existing pandas method, then we have nothing to worry about.

_Note: (If we were working with something like `xls` or `xlsx`, which are 'Microsoft Excel Open XML,' we would need map to the according method)_


In [None]:
def load_data(data_path: Path) -> List[pd.DataFrame]:
    """Loads all readable data files from a given directory into pandas DataFrames.

    Args:
        data_path (Path): Path object representing the base directory
            containing the data files.

    Returns:
        List[pd.DataFrame]: A list of pandas DataFrames, one per successfully
            loaded file.
    """
    files_found = [path for path in data_path.glob("**/*") if path.is_file()]

    result = []
    for found in files_found:
        # Give us the file extension (.<ext>) and then remove the '.' leaving us only with <ext>
        file_extension = found.suffix.lstrip(".")

        read_method = getattr(pd, f"read_{file_extension}", default=None)
        if callable(read_method):
            result.append(read_method(found))

    return result

In [None]:
data = load_data(DATA_PATH / "titanic")

print(f"Found: '{len(data)}' DataFrames")
data[2]

To check on how pandas interpreted each of the column data types can be done by requesting the pandas **`dtypes`** attribute:


In [None]:
titanic.dtypes

Here, for each of the columns, the used data type is enlisted. The data types in this **`DataFrame`** are integers (**`int64`**), floats (**`float64`**) and strings (**`object`**).

What is the (potential) consequence of **`dtype`** being **`object`** for strings? <br>
$\rightarrow$ Might not be the fastest approach


In [20]:
import pandas as pd

n = 1_000_000
series_obj = pd.Series(["hello"] * n, dtype=object) # Numpy ndarray
series_arrow = series_obj.astype("string[pyarrow]") # pyarrow string

print(series_obj.dtype)
print(series_arrow.dtype)

print("\nBenchmarking .str.upper() ...")

print("object dtype:")
%timeit series_obj.str.upper()

print("string[pyarrow] dtype:")
%timeit series_arrow.str.upper()

object
string

Benchmarking .str.upper() ...
object dtype:
75.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
string[pyarrow] dtype:
9.36 ms ± 16.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## How can you work with pandas DataFrames?


The Titanic data set consists of the following data columns:

- **`PassengerId`**: Id of every passenger.

- **`Survived`**: Indication whether passenger survived. 0 for yes and 1 for no.

- **`Pclass`**: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

- **`Name`**: Name of passenger.

- **`Sex`**: Gender of passenger.

- **`Age`**: Age of passenger in years.

- **`SibSp`**: Number of siblings or spouses aboard.

- **`Parch`**: Number of parents or children aboard.

- **`Ticket`**: Ticket number of passenger.

- **`Fare`**: Indicating the fare.

- **`Cabin`**: Cabin number of passenger.

- **`Embarked`**: Port of embarkation.


### 1. Learning Goals

About 6,000 odd "best books" were fetched and parsed from [Goodreads](https://www.goodreads.com/). The "bestness" of these books came from a proprietary formula used by Goodreads and published as a list on their web site.

We parsed the page for each book and saved data from all these pages in a tabular format as a CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data analysis to answer questions about these best books and popular genres.

By the end of this lab, you should be able to:

- Load and systematically address missing values (encoded as `NaN`) values in our data set, for example, by removing observations associated with these values.
- Parse columns in the dataframe to create new dataframe columns.

#### 1.1 Basic workflow

The basic workflow is as follows:

1.  **Build** a DataFrame from the data (ideally, put all data in this object)
2.  **Clean** the DataFrame. It should have the following properties:

- Each row describes a single object
- Each column describes a property of that object
- Columns are numeric whenever appropriate
- Columns contain atomic properties that cannot be further decomposed

3.  Explore **global properties**. Use histograms, scatter plots, and aggregation functions to summarize the data.

This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to followup in subsequent analysis.


### 2. Loading and Cleaning with Pandas

Read in the `goodreads.csv` file, examine the data, and do any necessary data cleaning.

Here is a description of the columns (in order) present in this csv file:

- `rating:` The average rating on a 1-5 scale achieved by the book
- `review_count`: The number of Goodreads users who reviewed this book
- `isbn`: The ISBN code for the book
- `booktype`: An internal Goodreads identifier for the book
- `author_url`: The Goodreads (relative) URL for the author of the book
- `year`: The year the book was published
- `genre_urls`: A string with '|' separated relative URLS of Goodreads genre pages
- `dir`: A directory identifier internal to the scraping code
- `rating_count`: The number of ratings for this book (this is different from the number of reviews)
- `name`: The name of the book


Let us see what issues we find with the data and resolve them.


---


After loading appropriate libraries


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)

#### 2.1 Cleaning: Reading in the data

We read in and clean the data from `data/goodreads.csv`.


In [None]:
# Read the data into a dataframe
df = pd.read_csv("data/goodreads.csv", encoding="utf-8")

# Examine a few rows of the dataframe
df

Oh dear. What are we looking at?

That does not quite seem to be right. We are missing the `column names`. We need to add these in! But what are they?

Here is a list of them in order:

- `rating`, `review_count`, `isbn`, `booktype`, `author_url`, `year`, `genre_urls`, `dir`, `rating_count`, `name`


#### Exercise: Load the Goodreads CSV file from `data/goodreads.csv` and

Use these column names to load the dataframe properly! And then "head" the dataframe (Tip: check look at the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) documentation)


In [None]:
# your code here

#### 2.2 Cleaning: Examing the dataframe - Quick checks

We should examine the dataframe to get a overall sense of the content.

**Exercise**
Lets check the types of the columns. What do you find?


In [None]:
df.dtypes

Notice that `review_count` and `rating_counts` are objects instead of ints, and the year is a float!

There are a couple more quick sanity checks to perform on the dataframe.


In [None]:
print(df.shape)
df.columns

#### 2.3 Cleaning: Examining the dataframe - A deeper look

Beyond performing checking some quick general properties of the data frame and looking at the first
rows, we can dig a bit deeper into the values being stored. If you haven't already, check to see if there are any missing values in the data frame.

Let's see for a column which seemed OK to us.


In [None]:
# Get a sense of how many missing values there are in the dataframe.
for col in df.columns:
    print(f"{col}: {df[col].isnull().sum()}")

In [None]:
# Alternative way
df[df.rating.isnull()]

How does `pandas` or `numpy` handle missing values when we try to compute with data sets that include them?

We'll now check if any of the other suspicious columns have missing values. Let's look at `year` and `review_count first`.

One thing you can do is to try and convert to the type you expect the column to be. If something goes wrong, it likely means your data are bad.

Lets test for missing data:


In [None]:
df[df.year.isnull()]

df.year.isnull()
df.shape

#### 2.4 Cleaning: Dealing with Missing Values

How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all 'missing' or 'invalid' values?


In [None]:
# Treat the missing or invalid values in your dataframe
#######

df = df[df.year.notnull()]

Ok so we have done some cleaning. What do things look like now? Notice the float has not yet changed.


In [None]:
df.dtypes

In [None]:
print(df.year.isnull().sum())
df.shape  # We removed seven rows

#### Exercise: Converting types

Ok so lets adjust those types. Convert them to `ints`. If the type conversion fails, we now know we have further problems.


In [None]:
# Your Code Here
df.rating = df.rating.astype(int)
df.review_count = df.review_count.astype(int)
df.year = df.year.astype(int)

Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:


In [None]:
df.dtypes

Great, now let's do it for some other columns with `NaN` that should be `strings`.


In [None]:
df.loc[df.genre_urls.isnull(), "genre_urls"] = ""
df.loc[df.isbn.isnull(), "isbn"] = ""

### 3. Parsing and Completing the Data Frame

We will parse the `author` column from the author_url and `genres` column from the genre_urls.


Examine an example `author_url` and reason about which sequence of string operations must be performed in order to isolate the author's name.


In [None]:
# Get the first author_url
author_string = df.author_url[1]
author_string

#### Exercise: Extracting all Authors


In [None]:
# Test out some string operations to isolate the author name

Lets wrap the above code into a function which we will then use


#### Exercise: Parsing out the genres


In [None]:
df.genre_urls.head()

Adjust the genres to be of type `<genre_1>|<genre_2>|...|<genre_n>`, e.g. `young-adult|science-fiction|fantasy`


In [None]:
# Your code here

Use map again to create a new "genres" column


### The Titanic dataset

The `titanic.csv` file contains data for 887 passengers on the Titanic. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their on-board class, their sex, and the fare they paid.


In [None]:
titanic = sns.load_dataset("titanic")
titanic.info()

In [None]:
titanic.columns

#### Exercise: Remove the following features:

`'embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone'`


In [None]:
# Your code here

#### Exercise: Find for how many passengeres we do not have their deck information.


In [None]:
# Your ode here

### Histograms

#### Plotting one variable's distribution (categorical and continous)

The most convenient way to take a quick look at a univariate distribution in `seaborn` is the `distplot()` function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

A histogram displays a quantitative (numerical) distribution by showing the number (or percentage) of the data values that fall in specified intervals. The intervals are on the x-axis and the number of values falling in each interval, shown as either a number or percentage, are represented by bars drawn above the corresponding intervals.


In [None]:
# What was the age distribution among passengers in the Titanic?
sns.set(color_codes=True)

f, ax = plt.subplots(1, 1, figsize=(8, 3))
ax = sns.histplot(titanic.age, kde=False, bins=20)


ax.set(xlim=(0, 90))
ax.set_ylabel("counts")

#### Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 5-10.


In [None]:
# Your Code here

### `pandas` Tricks

The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are immutable)
and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a
handful of ways to alter a DataFrame in-place:

- Inserting, deleting, or modifying a column.
- Assigning to the index or columns attributes.
- For homogeneous data, directly modifying the values via the values attribute or advanced indexing.

To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object,
leaving the original object untouched. If the data is modified, it is because you did so explicitly


### 4. Group Exercise: 1/2 hour in the Life of a Cardiologist

Try each exercise on your own and then discuss with your peers sitting at your table.

Visualize and explore the data. Use `.describe()` to look at your data and also examine if you have any missing values.
What is the actual number of feature variables after converting categorical variables to dummy ones?

**List of available variables (includes target variable `num`):**

- **age**: continuous
- **sex**: categorical, 2 values {0: female, 1: male}
- **cp** (chest pain type): categorical, 4 values
  {1: typical angina, 2: atypical angina, 3: non-angina, 4: asymptomatic angina}
- **restbp** (resting blood pressure on admission to hospital): continuous (mmHg)
- **chol (serum cholesterol level)**: continuous (mg/dl)
- **fbs** (fasting blood sugar): categorical, 2 values {0: <= 120 mg/dl, 1: > 120 mg/dl}
- **restecg** (resting electrocardiography): categorical, 3 values
  {0: normal, 1: ST-T wave abnormality, 2: left ventricular hypertrophy}
- **thalach** (maximum heart rate achieved): continuous
- **exang** (exercise induced angina): categorical, 2 values {0: no, 1: yes}
- **oldpeak** (ST depression induced by exercise relative to rest): continuous
- **slope** (slope of peak exercise ST segment): categorical, 3 values {1: upsloping, 2: flat, 3: downsloping}
- **ca** (number of major vessels colored by fluoroscopy): discrete (0,1,2,3)
- **thal**: categorical, 3 values {3: normal, 6: fixed defect, 7: reversible defect}
- **num** (diagnosis of heart disease): categorical, 5 values {
  0: less than 50% narrowing in any major vessel,
  1-4: more than 50% narrowing in 1-4 vessels
  }


In [None]:
# load the dataset
columns_heart = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]

heart_df = pd.read_csv("data/heart_disease.csv", header=None, names=columns_heart)
heart_df.head()

## A practical Introduction to seaborn

<div>
<a href="https://seaborn.pydata.org/examples/index.html">
<img src="images/seaborn.png" width="700"/>
</a>
</div>


### Answer the following question using plots

1.  At what ages do people seek cardiological exams?
2.  Do men seek help more than women?
3.  Examine the variables. How do they relate to one another?
4.  (Variation on 02): What % of men and women seek cardio exams?
5.  Does resting blood pressure increase with age?


**Pandas trick: `.replace`** The response variable (num) is categorical with 5 values, but we don't have enough data to predict all the categories. <BR> Therefore we'll replace `num` with `hd` (heart disease): **categorical, 2 values {0: no, 1: yes}**. <BR>
Use the code below (take a minute to understand how it works, it's very useful!):


In [None]:
# Replace response variable values with a binary response (1: heart disease(hd) or 0: not)
heart_df["num"].replace(to_replace=[1, 2, 3, 4], value=1, inplace=True)

# Rename column for clarity
heart_df = heart_df.rename(columns={"num": "hd"})
heart_df.head()

In [None]:
# look at the features
heart_df.info()

In [None]:
heart_df.describe()

#### Exercise: At which ages do people seek cardiological exams?


In [None]:
# Your code here

#### Exercise: Which of the listed genders seeks more help than the other?


#### Exercise: What percentage of men and women seek cardio exams?


#### Exercise: Examine the variables. How do they relate to one another?


#### Exercise: Does resting blood pressure increase with age?
