# Data Wrangling

Data wrangling is the art of manipulating a data set to prepare it for further data analysis.

##### Main Data Wrangling Operations

* Selecting columns
* Filtering rows
* Creating new columns
* Aggregating data
* Grouping data for aggregation
* Reordering/sorting data
* Randomly sampling rows

### Understanding the  `dplython` package

*The `dplython` package allows you to work in Python using a set of "verbs" that are taken from the `dplyr` R package. The functions in this package will hopefully make your data wrangling much easier and more intuitive.*

`DplyFrame`: A version of the `pandas` data frame that works with the `dplython` functions.

`X`: Will allow you to select columns without needing to use quotation marks.

`head`: Returns the specified number of rows.

`select`: Selects columns based on column name (or number).

`sift`: Filters rows based on criteria.

`arrange`: Sorts data by the specified column or columns.

`mutate`: Allows you to create new columns or modify existing columns.

`group_by`: Specifies how data should be grouped (useful for later aggregation).

`summarize`: Aggregates data based on a specified aggregation function. If grouping variables have been specified (using `group_by`), the aggregation will occur within each grouping variable. If not, aggregation will occur across the whole data frame.

`sample_n`: Randomly samples the data frame to return the specified number of rows.

`sample_frac`: Randomly samples the data frame to return the specified percentage of rows (a number between 0 and 1).

**For more information, visit the [`dplython` README](https://github.com/dodger487/dplython)**

### Setup

Working with the `dplython` package is very similar to working with `pandas`, with a few additional functions that you will need to load. In order to do this, however, you need to install the `dplython` package.

If you have not installed the `dplython` package, remove the comment (#) and run the command below.

In [None]:
# !pip install dplython

Now we need to import `pandas` and a set of functions from the `dplython` package.

In [None]:
import pandas as pd
from dplython import (DplyFrame, X, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize) 

Once you have installed the `dplython` package, we will load the Titanic data set that we have been working with over the past several weeks. Note that there is an additional step: casting the data frame as a `DplyFrame`.

In [None]:
# Read in the data frame as usual using pandas
df = pd.read_csv('train.csv')

# Then, cast the data frame to become a DplyFrame using the DplyFrame() function
df = DplyFrame(df)

### Using the Pipe Operator

One of the most powerful things about the `dplython` package is the `>>` ("pipe") operator. This allows you to chain together multiple steps in an easy-to-read way. When reading code, it is helpful to read `>>` as "then" in plain English.

In [None]:
# Start with your data frame, THEN show the first 10 rows
df >> head(10)

In [None]:
# Now write a line of code that displays the first 7 lines of the data frame




### Selecting Columns

In order to select a column without needing to use quotation marks, use `X`.

In [None]:
# Create a new data frame by starting with the original data frame,
# THEN selecting the PassengerId, Sex, Age, Fare, and Survived columns
new_df = (df >> select(X.PassengerId, X.Sex, X.Age, X.Fare, X.Survived))

In [None]:
# Start with your new data frame, THEN show the first 10 rows
new_df >> head(10)

In [None]:
# Without creating a new data frame, select only the PassengerId, Pclass, Sex, SibSp, and Embarked columns




### Filtering data

Use `sift` to create a smaller data frame based on criteria. Any rows meeting the criteria will be returned.

As with the `select` function, the `sift` function can use `X` notation to select columns without using quotation marks.

In [None]:
# Start with your new data frame, THEN filter it to only female passengers, THEN show only the first 10 rows
(new_df >>
    sift(X.Sex == 'female') >>
    head(10))

In [None]:
# You can also use multiple criteria in the sift() function, with each new criterion separated by a comma
# Start with your new data frame, THEN filter it to only female passengers, THEN show only the first 10 rows
(new_df >>
    sift(X.Sex == 'female', X.Age > 30) >>
    head(10))

In [None]:
# Starting with the new data frame, show the first ten rows of male 
# passengers whose fare was less than 10 and who survived




### Creating or modifying columns

In [None]:
# Start with the new data frame, THEN create a new column called FarePlusTen, THEN show only the first 10 rows
(new_df >>
    mutate(FarePlusTen = X.Fare + 10) >>
    head(10))

In [None]:
# You can create multiple columns at once with each new column separated by a comma
(new_df >>
    mutate(FarePlusTen = X.Fare + 10, FareTimesAge = X.Fare * X.Age) >>
    head(10))

In [None]:
# You can also modify existing columns by naming the "new" column the same as the old column
# You can create multiple columns at once with each new column separated by a comma
(new_df >>
    mutate(Fare = X.Fare.round()) >>
    head(10))

In [None]:
# Create two new columns of your own choosing




### Grouping and aggregating

Aggregation is a way of summarizing data by transforming more granular data into less granular data. However, sometimes you do not want to summarize the entire data frame but, rather, specific groups within the data frame. The `summarize` and `group_by` functions work together to perform these tasks.

In [None]:
# Start with your new data frame, THEN summarize it by taking the mean of the Fare column
new_df >> summarize(MeanFare = X.Fare.mean())

In [None]:
# You can also chain together multiple aggregations with each separated by a comma
(new_df >> 
     summarize(MeanFare = X.Fare.mean().round(),
               SumFare = X.Fare.sum(),
               MedianFare = X.Fare.median(),
               Count = X.Fare.count()))

To split the aggregations by a certain variable, use the `group_by` function

In [None]:
# Start with the new data frame, THEN group it by Sex, THEN create aggregations of the Fare column
(new_df >> 
     group_by(X.Sex) >>
     summarize(MeanFare = X.Fare.mean().round(),
               SumFare = X.Fare.sum(),
               MedianFare = X.Fare.median(),
               Count = X.Fare.count()))

In [None]:
# Show the average fare for both male/female and died/survived groups (four groups total)




### Sorting by columns

Use the `arrange` function to sort a data frame by one (or more) columns.

In [None]:
# Start with the new data frame, THEN sort it by Age
new_df >> arrange(X.Age)

In [None]:
# Sort by multiple columns (sort by the first column, then break ties by sorting by the second columnn)
new_df >> arrange(X.Sex, X.Age)

In [None]:
# Sort by reverse of numeric columns using negative sign
new_df >> arrange(-X.Age)

In [None]:
# Try to sort the new data frame using some different criteria




### Randomly sampling rows

Sometimes, it helps to take a random sample of your data. You can do this with either the `sample_n` function (which lets you specify a number of rows to return) or `sample_frac` (which lets you specify the percentage of rows to return).

In [None]:
# Start with the new data frame, THEN randomly return five rows
new_df >> sample_n(5)

In [None]:
# Start with the new data frame, THEN randomly return 2% of all records
new_df >> sample_frac(0.02)

In [None]:
# Randomly return 25 rows from the new data frame




In [None]:
# Now randomly return 1.5% of the rows from the new data frame


