# Data handling

In this notebook, we will work with the following:

- Reading data with `pandas`.
- Cleaning and transforming data.
- Viewing and selecting data.
- Merging and querying.
- Exporting.

In [None]:
import pandas as pd

In [None]:
pd.set_option("mode.copy_on_write", True)

# Reading data

`pandas` supports a number of formats that we often find ourselves using.
For example, I often use data in the Stata `dta` and SAS `sas7bdat` formats.
In particular, if you find yourself putting full datasets from WRDS (especially the ones that are not accessible with web forms), you will end up using the SAS format.

`pandas` also handles formats like Excel `xlsx`, comma separated values `csv` (and, indeed, nearly any delimited file), and fixed width data.
The acquisition database, SDC Platinum, has a somewhat unreliable Excel output feature, and the `pandas` fixed width format reader takes nearly all of the pain out of reading in data exported that way.

Note: `pandas` can also write many of the formats that it can read.
A notable exception is `sas7bdat` because it is proprietary and undocumented.
The reader was written with some clever reverse engineering, but writing a valid file is difficult and probably not coming in the future (see [Github issue](https://github.com/pandas-dev/pandas/issues/13031)).
An easy workaround is using the SAS open format `xpt` or `csv`.

In [None]:
# Stata data
firmyear = pd.read_stata("../data/firmyear.dta")
firmyear.head()

# Cleaning data

You are likely familiar with a number of data cleaning issues.
However, you may not yet know how to map on what you know in another program to Python.
The pandas documentation has a number of comparison references, including [R](https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html), [Stata](https://pandas.pydata.org/pandas-docs/stable/comparison_with_stata.html) and [SAS](https://pandas.pydata.org/pandas-docs/stable/comparison_with_sas.html).

Some brief examples are below.

## Data types

In [None]:
firmyear.dtypes

Note that all of the columns above are of type `object`, which often means that they are strings.
We want to change the things that we know are numbers (i.e. `count_of_employees` and `year`) into the appropriate types (both `int` in this case).

In [None]:
firmyear["year"] = firmyear["year"].astype("int")
firmyear["count_of_employees"] = firmyear["count_of_employees"].astype("int")

# Note, a more general version would be:
# cols = firmyear.columns.drop('name')
# firmyear[cols] = firmyear[cols].apply(pd.to_numeric, errors='coerce')

In [None]:
firmyear

In [None]:
firmyear.dtypes

## Renaming columns

We could rename columns by creating a new column with the correct name and dropping the prior one, but this is more efficient and easily extended to the multiple column case.

In [None]:
# An example of using dictionaries.
COLUMNS = {"count_of_employees": "size_emp"}

firmyear = firmyear.rename(columns=COLUMNS)

In [None]:
firmyear

## Transformations

We can also do transformations that apply some sort of function or method to data by groups.
This is a fairly simple example, but `pandas` makes it fairly easy to do sophisticated transformations.
See the [split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) documentation.
This is a big topic, and, like before, we are only scratching the surface.

In [None]:
# We can do per-group things like calculating differences.
firmyear["size_emp_change"] = firmyear.groupby(firmyear["name"])["size_emp"].diff()

In [None]:
firmyear

# Viewing and selecting data

pandas has a number of tools for viewing and selecting data.
The one we see above is the `df.head()` method that displays the first five rows at the top (or head) of the data.

In [None]:
firmyear.head()

In [None]:
# We can give it a parameter to modify the number of rows.
# Here, we only have six rows, so that's all we get.
firmyear.head(8)

In [None]:
# The len() function works on dataframes.
len(firmyear)

We can also select one or more columns by using indexing that is somewhat like what we did with dictionaries earlier.
However, we can give the indexer a list, and get the named columns.

Note that, when we ask for one column, pandas gives us a series, not a dataframe, so the display is a little less fancy.

In [None]:
firmyear["name"]

In [None]:
# Note the two sets of brackets.
# The outer set is for the indexing syntax.
# The inner set is for the list that we're asking the indexer for.
firmyear[["name", "year"]]

We can also ask for rows that meet certain conditions.

In [None]:
firmyear[firmyear["name"] == "Microsoft"]

In [None]:
# Note that the expression used for indexing is returning a series of boolean values.
firmyear["name"] == "Microsoft"

In [None]:
# We can use compound statements that return one boolean value per row.
# Here, it's name == Microsoft or the year is less than 2018.
firmyear[(firmyear["name"] == "Microsoft") | (firmyear["year"] < 2018)]

In [None]:
# Series have methods for checking whether they're NA.
# This is true if each row is not NA.
firmyear[firmyear["size_emp_change"].notna()]

In [None]:
# This is True if each row is NA.
firmyear[firmyear["size_emp_change"].isna()]

In [None]:
# We can also use ~ to negate the condition after it.
# So, here is not not NA (same as is NA).
firmyear[~firmyear["size_emp_change"].notna()]

# Merging

Like other software, `pandas` is great and merging data, and it as some conveniences not found in most other software.

Let's work through a simple example to see it in action.

In [None]:
# Remember our firm year data.
firmyear.head()

In [None]:
stock = pd.read_csv("../data/stock.csv")
stock.head()

What we'd like to do is merge in those Microsoft stock prices from the beginning of those years.
It's a bit contrived for an example, but it mirrors a lot of real world work.

While we know that Microsoft's ticker is MSFT, there's no way for `pandas` to know that without help.
So, to help, we'll make a lookup table using a dictionary.

In [None]:
lookup = {"Microsoft": "MSFT", "Google": "GOOG"}

In [None]:
firmyear["id_ticker"] = firmyear["name"].map(lookup)
firmyear.head()

In [None]:
# Let's make that lowercase.
firmyear["id_ticker"] = firmyear["id_ticker"].str.lower()
firmyear.head()

In Stata, we would have another problem, namely that our column names for merging do not match.
With `pandas`, that's not a problem.

Note the validate parameter. This tells pandas that we have an expectation about how these data align with each other, and it should raise an exception if our expectation isn't met.
If you merge data without this parameter, and it unexpectedly grows in length, you may be unintentionally doing a many-to-many merge (which generally returns a new row for every pair of matches within the groups specified).

In [None]:
firmyear = firmyear.merge(
    stock,
    how="left",
    left_on=["id_ticker", "year"],
    right_on=["tic", "yr"],
    validate="1:1",
)

In [None]:
firmyear.head()

# Querying

When working with content data, we often need to do some sort of a query to aggregate data that is interesting to us.

For example, let's add the an average word count of articles from some NYT data (similar to what we'll retrieve later) to our firmyear data.
We're only going to have results for 2018, as that's all the data I included.

In [None]:
msft_nyt = pd.read_csv("../data/msft_nyt.csv", index_col=False)

In [None]:
msft_nyt["pub_date"] = pd.to_datetime(msft_nyt["pub_date"])
msft_nyt.head()

In [None]:
_AGG = {"word_count": ["mean", "sum"]}


def query_docs(data, ticker, year):
    summary = (
        data[(data["id_ticker"] == ticker) & (data["pub_date"].dt.year == year)]
        .agg(_AGG)
        .T.reset_index(drop=True)
    )
    summary["id_ticker"] = ticker
    summary["year"] = year
    summary = summary.rename(columns={"mean": "wc_mean", "sum": "wc_sum"})
    return summary

In [None]:
result_list = []
for index, row in firmyear[["id_ticker", "year"]].iterrows():
    result_list.append(query_docs(msft_nyt, row["id_ticker"], row["year"]))
results = pd.concat(result_list)

In [None]:
# Here's the same operation condensed into a list comprehension.
# results = pd.concat([query_docs(msft_nyt, row['id_ticker'], row['year'])
#                      for i, row in firmyear[['id_ticker', 'year']].iterrows()])

In [None]:
results

In [None]:
firmyear = firmyear.merge(results, how="left", on=["id_ticker", "year"], validate="1:1")
firmyear.head()

# Saving and exporting

pandas is able to write data in a number of formats that you may need.
You can see a [reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the user guide.

Two in particular merit an additional mention:

1. **Parquet.** Apache Parquet is a high-performance compressed file format that I like to use for data that I want to use again in Python. It retains the type information, and it continues to work well up to file sizes of a few GBs.
1. **SQL.** If you are working with a database directly (including the ones we will see later), the SQL support in pandas is really convenient. That said, if you are using a service with its own package (e.g., WRDS), you probably want the more specific package.

# Breakout Exercises

Time permitting, try out some of the data handing techniques that we learned above on your own data.

## EX1: try your data

Let's use pandas on a dataset you already have.

1. Read your dataset into a pandas dataframe with the name `my_data`. To find the proper function, you may want to look at the [pandas IO reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
1. Display the first 10 rows.
1. Display the datatypes of the columns. Notice any problems.
1. Try some of the skills we learned above. For example, you might rename a column or select the data where a certain column takes a value (or satisfies some condition).

In [None]:
# 1-1 code

In [None]:
# 1-2 code

In [None]:
# 1-3 code

In [None]:
# 1-4 code

# Bonus content

One thing to notice in our code above is that we have several datasets all in memory at once.
In some stats packages, this is not nearly so easy.

For example, in Stata, they recently added the concept of multiple datasets, but the interface is much more difficult to use.
In contrast, with pandas, we simply use the name of the dataframe and then whatever operation that we are doing.

In [None]:
firmyear.head()

In [None]:
stock.head()

In [None]:
msft_nyt.head()