# Module 3: Input and Output

Up until this point, we have dealt only with data that we created "on the fly" within our Python programs. We now turn to the all-important concept of ingesting data into a Python program from persistent sources, manipulating it, and then creating output. We will primarily be using "flat files" such as `.txt` and `.csv` files to begin our exploration of the built-in I/O capabilities of Python. Then we introduce one of the most-used Python packages for data analysis: `pandas`. The `pandas` package provides fast, flexible, and expressive data structures that make working with tabular data easy and intuitive.

## Learning Objectives

* Effectively open, read, and manipulate data from various file types using standard Python libraries
* Use the `pandas` package for data input and manipulation
* Export data in various file types


## Introduction to I/O

We first introduce the concept of the input/output (i.e., I/O) process along with **files** and their different types. Next, using Python's built-in I/O capabilities, we will read, manipulate, and write text files. We also explore manipulating `.csv`, HTML, and JSON files.

### Jupyter Notebook

We will be using the [input/output][io] notebook.


### Objectives

By the end of this section, you will understand:

- What a file is and how it relates to the directory structure on your computer
- Various file types and their differences
- What a `.csv` file is and what `csv` stands for

### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.

1. The official API reference for [`io`][1].
2. The [tutorial for reading and writing files][3].
3. Introducing [JSON][4].

-----

[1]: https://docs.python.org/3/library/io.html
[3]: https://docs.python.org/3/tutorial/inputoutput.html#tut-files
[4]: https://www.json.org/json-en.html

[io]: input_output.ipynb

----

## Introduction to `pandas`

In this section, we introduce one of the most powerful and useful packages/modules available - `pandas`. It is the de facto standard for data manipulation and data analysis in Python. 

### Jupyter Notebook

We will be using the [introduction to `pandas`][pandas] notebook.

###  Objectives

By the end of this section, you will understand:

- Creating and using `Series` data structures
- Creating and using `DataFrame` data structures
- Using various operations on `Series` and `DataFrame` objects to accomplish data manipulation tasks
- Reading and writing `.csv` files
- Reading and writing `.xlsx` files


### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.

1. The official API reference for [`pandas.Series`][1].
2. The official API reference for [`pandas.DataFrame`][2].
3. The [user guide][3] for `pandas`.

-----

[1]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
[2]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
[3]: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

[pandas]: intro_to_pandas.ipynb

## Understanding Data Wrangling

We begin our exploration of data wrangling by defining it and briefly discussing several of the common tasks for the process. We also explicitly define **tidy data** and look at a few small examples.

### Jupyter Notebook

We will be using the [introduction to data wrangling][data_wrang] notebook.


### Objectives

By the end of this section, you will understand:

- What data wrangling is
- The difference between tidy and messy data
- The concepts of data cleaning, data transformation, and data enrichment


### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.

1. [Wickham, H.. (2014). Tidy Data. *Journal of Statistical Software*,59(10), 1-23][1].
2. [The `pandas` User Guide][2].


-----

[1]: https://www.jstatsoft.org/article/view/v059i10
[2]: https://pandas.pydata.org/docs/user_guide/index.html


[data_wrang]: intro_to_data_wrangling.ipynb

## Aggregating Data

There are numerous ways to aggregate your data. You may simply want to enrich the data by creating new columns based off other columns, or you may need to combine disparate datasets into a single `DataFrame` to make your analysis easier. Sometimes, your goal is to summarize your data, create pivot tables, or create cross tabulations. We discuss these various tasks in this section.

### Jupyter Notebook

We will be using the [aggregating data][agg] notebook.


### Objectives

By the end of this section, you will understand:

- Various ways to query a `DataFrame`
- The most common types of merging or joining `DataFrame`s
- Binning 
- Applying functions to both a `Series` and a `DataFrame`
- Techniques for summarizing a `DataFrame`
- How to aggregate by group
- How to create pivot tables and cross tabulations


### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.

1. [The official API reference for `pandas.DataFrame.query`][1].
2. [The official API reference for `pandas.Series.pct_change`][2].
3. [The official API reference for `pandas.cut`][3].
4. [The official API reference for `pandas.qcut`][4].
5. [The official API reference for `pandas.Grouper`][5].

-----

[1]: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
[2]: https://pandas.pydata.org/docs/reference/api/pandas.Series.pct_change.html
[3]: https://pandas.pydata.org/docs/reference/api/pandas.cut.html
[4]: https://pandas.pydata.org/docs/reference/api/pandas.qcut.html
[5]: https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html

[agg]: aggregating_data.ipynb

-----

**&copy; 2022 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**