# Appendix 3: Importing and Modifying Data for biostatistics

<div class="alert alert-info">Learning goals</div>

1. Import data of different formats into Python using Pandas.
2. Identify and handle missing values in a dataset.
3. Transform and reshape data to suit the needs of your analysis.
4. Understand how to merge multiple datasets into a single, coherent dataset.

## Introduction

In the realm of biostatistics, dealing with data is a key part of the process. Often, this data is not collected with the intention of being processed in Python, leading to potential challenges. This chapter provides an overview of importing and modifying data in Python from a biostatistics perspective. We will discuss common methods for importing data, techniques for handling missing values, and ways to transform and reshape your data to suit your analysis.

Before we begin, ensure you have the following essential Python libraries installed:

- Pandas: Provides data structures and data analysis tools.
- Numpy: Allows numerical operations on arrays and matrices.
- Matplotlib: Facilitates the creation of static, animated, and interactive visualizations in Python.

You can install these packages using pip:

In [17]:
%%script false --no-raise-error

pip install pandas numpy matplotlib

### Importing Data

Data in biostatistics can come in various formats such as CSV (Comma-Separated Values), Excel (xls, xlsx), or TSV (Tab-Separated Values). We'll primarily use the Pandas library to read these files into Python:

In [18]:
%%script false --no-raise-error

import pandas as pd

# Loading a CSV file
df_csv = pd.read_csv('filename.csv')

# Loading an Excel file
df_excel = pd.read_excel('filename.xlsx')

# Loading a TSV file
df_tsv = pd.read_csv('filename.tsv', sep='\t')

### Dealing with Missing Values

Real-world datasets often have missing values, and these missing values can lead to misleading results if not handled properly. Let's look at some techniques to handle missing values:

In [19]:
%%script false --no-raise-error

import pandas as pd

# Check for missing values
print(df_csv.isnull().sum())

# Drop rows with missing values
df_no_missing = df_csv.dropna()

# Fill missing values with a specified value (e.g., mean, median, mode, or a constant value)
df_fill_mean = df_csv.fillna(df_csv.mean())

### Data Transformation and Reshaping

Often, data may not be in a form that is ready for analysis, and you may need to reshape or transform it:

In [4]:
%%script false --no-raise-error

import pandas as pd

# Renaming columns
df_csv.rename(columns={'OldName1': 'NewName1', 'OldName2': 'NewName2'}, inplace=True)

# Converting a continuous variable into a categorical variable
bins = [0, 10, 20, 30, 40]
labels = ['0-10', '10-20', '20-30', '30-40']
df_csv['AgeGroup'] = pd.cut(df_csv['Age'], bins=bins, labels=labels)

# Pivoting data (reshaping)
df_pivot = df_csv.pivot(index='ColumnNameToIndex', columns='ColumnNameToColumns', values='ColumnNameToValues')

# Merging two dataframes
df_merge = pd.merge(df_csv1, df_csv2, on='CommonColumnName', how='inner') # Can be 'outer', 'left', 'right'

## Conclusion

The importance of being able to import, modify, and clean data cannot be understated in biostatistics. It's crucial for researchers and data scientists to familiarize themselves with these techniques to prepare their data correctly for downstream statistical analysis.