# Preliminaries

In [1]:
import pandas as pd
import numpy as np

The "object" data type (for the ``industry`` and ``ind_code`` columns) is a catch-all term for when Pandas can not determine the exact data type of that column (e.g. int, float, str, etc). Many times, columns containing strings will have this data type.

# Missing values

Missing values appear as a special code depending on the datatype of the column in which they appear: ``NaN`` (which stands for "not a number") for numeric data types, ``None`` or ``NaN`` for object data type, ``NaT`` for "datetime" columns (more on this data type later). 

To find the missing values in the data, we can use the ``.isnull`` (or its equivalent: ``.isna()``):

We can drop all the rows that have any missing values using the ``.dropna()`` function:

If we want to drop the rows that have **only** missing values, we have the use ``how = 'all'`` as a parameter:

If we want to remove all raws that contain missing values in a given column, we have to use ``.loc[]`` and the ``.notnull()`` function:

or we can use the ``subset`` parameter of the ``dropna`` function, which tells the function to look for missing values only in a subset of the columns: 

# Changing data types

Many times, a particular column in our dataframe does not have the datatype we want. There are several functions that allow us to convert one datatype to another. Below, we cover the most commonly used ones:

## ``.astype()``

Specify the new datatype that you want to convert to as an argument to ``.astype()``:

It may not look like ``firmid`` is a string data type now, but it is. For example, the below command would not work if ``firmid`` was still numeric:

## ``.to_numeric()``

This is commonly used to convert string (or object) data types to a numeric data type. Unlike ``.astype()`` which can be applied after the name of the dataframe we want to convert, with ``.to_numeric()``, you have to supply that dataframe as an argument:

In some situations, the ``.to_numeric()`` function will not be successful unless you specify the parameter ``errors = `coerce'``. For example, the code below would not work without that parameter (which is why I always specify it):

Note that this converted the non-numeric values in the ``ind_code`` column to ``NaN``:

# Duplicates and counts

In many situations, it is important to know if our data contains any duplicate entries (most of the time we want to eliminate those) as well as explicitly count duplicate entries in any particular column (or set of columns) in our data. We can perform these operations with the ``.duplicated()`` and ``.value_counts()`` functions:

## ``.duplicated()`` and ``.drop_duplicates()``

Syntax:
```python
DataFrame.duplicated(subset=None, keep='first')
```

where the ``subset`` parameter allows us to specifies where in the dataset (which columns) we are looking for duplicated rows (if unspecified, Pandas will look for instances where an entire row is duplicated). The ``keep`` parameter allows us to specify which of the duplicated rows to keep (if any).

To drop duplicated data, we can use the ``.duplicated()`` function inside a ``.loc[]``:

or, more commonly, using the ``.drop_duplicates()`` function:

Note that the above still keeps the 4th row, and drops the 5th (a duplicate of the 4th). This is because ``keep='first'`` by default for the ``.drop_duplicates()`` function. To eliminate both duplicated rows, we would have to set ``keep=False``:

Note also that the meaning of "first" and "last" for the ``keep`` parameter depends on how your dataframe happens to be sorted at the time you drop the duplicates.

## ``.value_counts()``

This finds all the unique values in a column and counts the number of times they appear in that column.

Syntax:
```python
DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)
```

# Operating on text data (strings)

Working with text data is a huge topic in data analysis. The Pandas user guide offers a detailed discussion on the way the Pandas package can be used to operate on text data: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#. For the most part, all of this is done with the ``.str`` subpackage and its methods. 

Here, we cover a very small subset of the functions that are commonly used for string manipulation inside a dataframe.

We'll work on the ``df`` dataframe:

It is important to convert a text column to ``string`` type before we manipulate it with ``.str`` functions. For example, the ``industry`` column is currently of type ``object`` so we will convert it to ``string``:

## Slicing into string data

## Converting to lower case or upper case

## Substrings

## Splitting

## Stripping white spaces

## Chaining ``.str`` methods