# Pandas GroupBy

Pandas [`groupby()`](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) function offers a simple way to group data within a DataFrame by some set of categorical values within one of the columns. In this notebook, we'll read a csv containing multiple UFO sightings across the world, we'll cleanse that data, and then group the data to further analyze the values.

- [Pandas `groupby()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- [Article about `groupby()`](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/)

## Objectives:
- Identify and remove rows with null values
- Create a new DataFrame by filtering an existing DataFrame
- Gather a count of occurrences of categorical values in a column
- Convert a column's datatype to numeric
- Group data by categorical values in a column
- Create a new DataFrame using grouped values

#### Import Dependencies

In [None]:
import pandas as pd
import os

#### Load the provided csv from the *Resources* folder

In [None]:
# Create a reference the CSV file desired
csv_path = os.path.join("..", "Resources","ufoSightings.csv")

# Read the CSV into a Pandas DataFrame
ufo_df = pd.read_csv(csv_path)

# Print the first five rows of data to the screen
ufo_df.head()

#### Remove the rows with missing data

In [None]:
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

#### Convert the *duration (seconds)* column's values to numeric

If you encounter a `SettingwithCopyWarning`, it is a good idea to review your code. In short, this warning means that you may not actually be setting the values as you think you are, or you may be setting the values of more objects than you actually intend to be. [Here is a reading more information on this.](https://www.dataquest.io/blog/settingwithcopywarning/)

In [None]:
clean_ufo_df["duration (seconds)"] = clean_ufo_df["duration (seconds)"].astype(float)

#### Filter the data so that only those sightings in the US are in a DataFrame

In [None]:
usa_ufo_df = clean_ufo_df[clean_ufo_df["country"] == "us"]
usa_ufo_df.head()

#### Count how many sightings have occured within each state

We're storing the output of `value_counts()` in a variable so we can use it later in our code

In [None]:
state_counts = usa_ufo_df["state"].value_counts()
state_counts.head()

___
## GroupBy

#### Use [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) in order to aggregate the data according to the values in the  "state" columns

The `groupby()` function requires that you provide the column(s) that you'd like to use as the categorical values that you'd like to group your data by.

Notice when we print out the results, we cannot actually view the grouped content yet. The `groupby()` function returns a "GroupBy object", which requires a bit more information before we it returns the values.

In [None]:
grouped_usa_df = usa_ufo_df.groupby(['state'])
print(grouped_usa_df)

#### In order to be visualized, we have to provide a data function to tell it how to return the values

In this case, we are calling the `count()` function against our groupby object, which will return the number of non-null values for that row/column intersection.

Once a data function has been provided, the resulting output is a standard DataFrame, but the row indices are now the unique values for whichever column you've grouped by.

In [None]:
grouped_usa_df.count().head(10)

#### Return the *duration (seconds)* column with the `mean()` aggregation function

We can specify a particular column, as we would with a regular DataFrame, but we still need to provide a function to inform the groupby object of how it should aggregate the numbers.

Since *duration (seconds)* was converted to numeric values, we can now sum them up by state.

In [None]:
state_duration = grouped_usa_df["duration (seconds)"].mean()
state_duration.head()

#### Creating a new DataFrame using both duration and count

Using the `state_counts`, which we calculated previously using `value_counts()`, along with `state_duration` (the output from our group_by object above), we can create a dataframe that summarizes our data by state.

Notice that `state_counts` is really just the same output as what our groupby object returned for the `count()` function.

In [None]:
state_summary_table = pd.DataFrame({"Number of Sightings": state_counts,
                                    "Average Duration (s)": state_duration})
state_summary_table.head()

___
## GroupBy with multiple columns
#### It is also possible to group a DataFrame by multiple columns
This returns an object with multiple indices, however, which can be harder to deal with

In [None]:
grouped_international_data = clean_ufo_df.groupby(['country', 'state'])

grouped_international_data.count().head(20)

#### Converting a GroupBy object into a DataFrame

In the case below, we're specifying an individual column that we want to return from our groupby object, so the result will be a *Series*. As with any *Series*, we can convert it do a *DataFrame* using `pd.DataFrame()`.

In [None]:
international_duration = pd.DataFrame(grouped_international_data["duration (seconds)"].sum())
international_duration.head(10)