# Pandas GroupBy

Pandas [`groupby()`](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) function offers a simple way to group data within a DataFrame by some set of categorical values within one of the columns. In this notebook, we'll read a csv containing multiple UFO sightings across the world, we'll cleanse that data, and then group the data to further analyze the values.

- [Pandas `groupby()` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- [Article about `groupby()`](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/)

## Objectives:
- Identify and remove rows with null values
- Create a new DataFrame by filtering an existing DataFrame
- Gather a count of occurrences of categorical values in a column
- Convert a column's datatype to numeric
- Group data by categorical values in a column
- Create a new DataFrame using grouped values

#### Import Dependencies

In [1]:
import pandas as pd
import os

#### Load the provided csv from the *Resources* folder

In [2]:
# Create a reference the CSV file desired
csv_path = os.path.join("..", "Resources","ufoSightings.csv")

# Read the CSV into a Pandas DataFrame
ufo_df = pd.read_csv(csv_path)

# Print the first five rows of data to the screen
ufo_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


#### Remove the rows with missing data

In [4]:
clean_df = ufo_df.dropna()
clean_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175,-73.408333


#### Convert the *duration (seconds)* column's values to numeric

If you encounter a `SettingwithCopyWarning`, you can ignore it for this exercise. In the real world, it is a good idea to review your code when encountering this warning. In short, this warning means that you may not actually be setting the values as you think you are, or you may be setting the values of more objects than you actually intend to be. [Here is a reading more information on this.](https://www.dataquest.io/blog/settingwithcopywarning/)

In [8]:
# check types
#clean_df.dtypes

#pd.to_numeric(clean_df['duration (seconds)'])
clean_df['duration (seconds)'] = clean_df['duration (seconds)'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [10]:
clean_df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
latitude                 object
longitude               float64
dtype: object

#### Filter the data so that only those sightings in the US are in a DataFrame

In [12]:
us_df = clean_df[clean_df['country']=='us']
us_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300.0,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
7,10/10/1965 23:45,norwalk,ct,us,disk,1200.0,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175,-73.408333


#### Count how many sightings have occured within each state

Store the output of `value_counts()` in a variable so we can use it later in our code

In [15]:
state_counts = us_df['state'].value_counts()

___
## GroupBy

#### Use [`GroupBy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) in order to aggregate the data according to the values in the  "state" columns

In [16]:
grouped_us = us_df.groupby(['state'])

In [17]:
grouped_us

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000019339A8C948>

#### In order to be visualized, we have to provide a data function to tell it how to return the values
Here we'll use the `count()` function

In [18]:
grouped_us.count()

Unnamed: 0_level_0,datetime,city,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ak,311,311,311,311,311,311,311,311,311,311
al,629,629,629,629,629,629,629,629,629,629
ar,578,578,578,578,578,578,578,578,578,578
az,2362,2362,2362,2362,2362,2362,2362,2362,2362,2362
ca,8683,8683,8683,8683,8683,8683,8683,8683,8683,8683
co,1385,1385,1385,1385,1385,1385,1385,1385,1385,1385
ct,865,865,865,865,865,865,865,865,865,865
dc,7,7,7,7,7,7,7,7,7,7
de,165,165,165,165,165,165,165,165,165,165
fl,3754,3754,3754,3754,3754,3754,3754,3754,3754,3754


#### Return the *duration (seconds)* column with the `mean()` aggregation function

In [25]:
#specified column
state_means = grouped_us['duration (seconds)'].mean()
#grouped_us['duration (seconds)'].sum()
#grouped_us['duration (seconds)'].max()
#grouped_us['duration (seconds)'].min()

#only runs columns it can do analysis on data
#grouped_us.mean()

#### Creating a new DataFrame using both duration and count

Using the `state_counts`, which we calculated previously using `value_counts()`, along with `state_duration` (the output from our group_by object above), we can create a dataframe that summarizes our data by state.

Notice that `state_counts` is really just the same output as what our groupby object returned for the `count()` function.

In [29]:
overview_df = pd.DataFrame({"Counts":state_counts, "Means": state_means})
overview_df.head()

Unnamed: 0,Counts,Means
ak,311,4681.231511
al,629,1431.563593
ar,578,115892.983564
az,2362,6542.546401
ca,8683,2863.707413


___
## GroupBy with multiple columns
#### It is also possible to group a DataFrame by multiple columns
This returns an object with multiple indices, however, which can be harder to deal with

In [33]:
international_grouped = clean_df.groupby(['country', 'state'])
#international_grouped.count().head(15)
international_grouped.mean().head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,duration (seconds),longitude
country,state,Unnamed: 2_level_1,Unnamed: 3_level_1
au,al,900.0,144.963425
au,dc,300.0,153.099533
au,nt,180.0,130.836639
au,oh,180.0,138.598633
au,sa,152.5,138.53265
au,wa,225.0,116.879394
au,yt,30.0,152.908936
ca,ab,1869.697183,-111.727571
ca,bc,948.236071,-111.696363
ca,mb,1291.387097,-98.004436


#### Converting a GroupBy object into a DataFrame

In [38]:
new_df = pd.DataFrame(international_grouped.mean())

In [39]:
new_df.reset_index()

Unnamed: 0,country,state,duration (seconds),longitude
0,au,al,900.000000,144.963425
1,au,dc,300.000000,153.099533
2,au,nt,180.000000,130.836639
3,au,oh,180.000000,138.598633
4,au,sa,152.500000,138.532650
...,...,...,...,...
79,us,vt,1042.462598,-72.818134
80,us,wa,15273.474357,-121.719425
81,us,wi,1928.422656,-89.183426
82,us,wv,6791.901826,-80.796210
