# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [None]:
import pandas as pd

Line 1:
Line assigns a string to variable "csv_path". The string points to the location of a file on the hard drive.
Line 3: 
Line instructs Python to create a dataframe called "ufo_df" using pandas' read_csv function to "read" the file located at the location specified by "csv_path" and interpret the contents.
Line 5:
Line displays the first (default since no other amount specified in parentheses) five rows of the dataframe "ufo_df".

In [None]:
csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

Line 1:
Line instructs Python to count how many non-null entries are in each column of the "ufo_df" dataframe. Returns a series of column labels and corresponding entry counts (numeric).

This is helpful for quickly checking the completeness of the data. If there are appreciable amounts of data missing, exluding it from further analyses could improve workflow.

In [None]:
ufo_df.count()

Line1:
Line defines "clean_ufo_df" to refer to a more specific state or form of "ufo_df", which excludes rows/columns that contain missing values, according to the how. Since how="any", rows and columns that have any number of missing values will be excluded.

Any:
For how="any" rows/columns will be dropped if a single value is missing. The benefit to using "any" is that the cleaned dataset will not contain nulls, which could allow for simpler/easier analysis and nicer looking vizualizations. The drawback to using "any" is that the size of the dataset could be drastically reduced, or the data could become much less representative of the intended sample population. 

All:
For how="all" rows/columns will be dropped if all the values in the row or column are missing. The benefit to using "all" is that more data would remain in the dataset, which could still be valuable to analysis. This would be the case especially if the missing values were not very important to begin with. The drawback to using all is that any calculations including existing values from the range could be less accurate relative to other ranges without missing values. 

Line 2:
Line instructs Python to count how many non-null entries are in each column of the "ufo_df" dataframe. Returns a series of column labels and corresponding entry counts (numeric).

In [None]:
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

Line 1:
Line creates new list "columns", which contains the listed strings, to be referenced later by "loc". The list "columns" will be used by "loc" to specify which columns to retain.

Line 2:
Line defines "usa_ufo_df" to refer to a specific state or form of "clean_ufo_df", whose contents are specified by the "loc" function. "loc" accepts label inputs for the desired rows and columns. The first input specifies that only rows whose values in the "country" column are equivalent to "us" should be included in "usa_ufo_df". The second input specifies that only columns whose labels are in the list "columns" should be included in "usa_ufo_df.

Line 13:
Line displays the first (default since no other amount specified in parentheses) five rows of the filtered dataset "usa_ufo_df".

In [None]:
columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]
usa_ufo_df.head()

Line 1:
Line defines "state_counts" to be the output of value_counts, which  will count the instances of each unique entry value that occurs in the specified "state" column of "usa_ufo_df". Returns series of unique values and their corresponding instance counts (numeric).

Line 2: 
Line calls for the "state_counts" series to be displayed on the screen.

The utility of these steps is that a summary of the range data can quickly be obtained and reviewed; Potential skews, outliers, etc. can be intermediately identified. Additionally, the value_counts output series can be stored as a variable and easily referenced later in the workflow.

In [None]:
state_counts = usa_ufo_df["state"].value_counts()
state_counts

Line 1:
Line instructs Python to create dataframe called "state_ufo_counts_df" using pandas' DataFrame function, which accesses the data stored as "state_counts" variable. The utility of "passing" the data as a dataframe is additional functionality/operation that the dataframe object (series) doesn't have. Examples of this utility are necessary for the following steps in which column labels are renamed to be more descriptive, and the data can be manipulated similarly to or presented alongside other dataframe entries.

Line 2:
Line displays the first five rows of the dataset "state_ufo_counts_df".

In [None]:
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

Line 1:
Line passes a dict object to the columns parameter of the rename function, which changes the existing column label (specified by the dict's key) to the the new label (specified by the dict's value). This is more user-friendly because the previous label "state" was heading a column of integers (because the contents of the state column were used as the index when the new dataframe was created), which are actually the number of sightings in each state.

Line 3:
Line displays the first five rows of the dataframe under the new column label "Sum of Sightings".

In [None]:
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

Line 1:
Line calls the dtypes property which displays the dataframe's column labels along with their datatype (based on the consituent entry values). It's helpful to know the columns' datatypes so additional steps can be planned and executed based on the actions (manpulations, interactions) specific to or prohibited by the corresponding datatype.

In [None]:
usa_ufo_df.dtypes

Line 1:
Line uses loc function to specify the entries for all rows in the "duration (seconds)" column of usa_ufo_df, whose datatype is then passed as type "float". By changing the datatype to be numeric, arithmetic operations can be performed on the values.

Line 2:
Line calls the dtypes property which displays the dataframe's column labels along with their datatype (based on the consituent entry values).

In [None]:
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")
usa_ufo_df.dtypes

_[Replace this with your clear explanation of what happens in the cell below. What is the output and how were we able to accomplish this?]_
Line 2:
Line instructs Python to sum all the float values in the "duration (seconds)" column, which can be performed because we changed the datatype from object to float.

In [None]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

_[Replace this with your clear explanation of what happens in the cell below. How did we group by two columns, and what are we now able to do as a result? Lastly, explain what does this output tell you.]_
Line 1:
Line defines "grouped_data" as a specific form of usa_ufo_df which organizes the dataset so entries are grouped by "state" and grouped further by "city". By doing so, if no other columns contain duplicate entries, we can count the number of unique entries to determine how many ufo sights have occurred in each city.

Line 4:
Line instructs Python to count the number of unique "datetime" entries, each of which represents a different ufo sighting. The output presents how many ufo sightings have occurred in each city. 

In [None]:
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
grouped_data['datetime'].count()