# Module 6, Class 3: Manipulating Data

In [3]:
# run this cell to import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

First, we'll start with a messy dataset. This is the results of the google form that you just filled out, along with some other responses from other OTD members.

Keep in mind that the dataset that we are using is what we call *structued data*. The values are stored in columns and rows.

In [None]:
data = pd.read_csv('responses.csv')
data.head()

We will use some of the methods we just learned about to clean this data. In particular, we will handle null values, convert data types, drop duplicates, and perform some string manipulation.

## Handling Null Values
First off, we are going to handle null values in our dataset. We can find the null values in our dataset with the following commands:
- `df.isna()`
- `df.notna()`

You can call it on the entire dataframe, or pass in a particular column. As far as handling the null values, there are a couple ways of doing this:
- `fillna(<value>)`
- `dropna()`

We can fill null values with another value such as 'missing' or 0, or the mean of the data, or something else. Depending on the data, we might want to drop the null values.

## Convert Data Types

Say that we have a column called `age`, but the values are being read as strings. If we want to calculate the average or maximum / minimum age, we need the age values as integers. We can do this through type conversion.

For example, we can use the pandas [astype()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) function to convert our string age column to an integer column.

`df['age'].astype(int)`

## Drop Duplicates

If someone filled our our form more than once with the same exact responses, we may want to drop the duplicates. Imagine that you are storing your data in a database where you have to pay for each row that you store - in this case, we don't want to store any duplicate data that we don't need since it'll be a redundant cost.

Luckily for us, there is a [pandas function to drop duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html). This will drop duplicate rows and return the dataframe.

`df.drop_duplicates`

## Summary Statistics

We have learned about the different ways that summary statistics can be used to help us with our analysis. Let's take a look at our survey data again and calculate some summary statistics.
- `df.describe()`
- `df[column].mean()`
- `df[column].min()`
- `df[column].max()`

## String Manipulation

Pandas has a built in `str` module which we can use to manipulate string data. For more complex string data, you can use [Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression). There are free online interpreters available such as [regex101](https://regex101.com/) which are helpful for testing out regex code. However, we'll focus on the simpler built in pandas module.

We can use methods such as:
- `df[column].str.replace()`
    - replace a string value in a series
- `df[column].str.contains()`
    - see if a series contains a string or substring
- `df[column].str.lower()`
    - make strings all lowercase
- `df[column].str.upper()`
    - make strings all uppercase
- `df[column].str.len()`
    - calculate the length of a string
- `df[column].str.cat()`
    - `cat` is short for concatenate, which is like adding 2 strings together. This is helpful for cases such as combining first and last name to make full name.